DEV Community

Cover image for Binary Quantization: the 1-bit trick that turns terabytes of vectors into pocket-sized fingerprints
Abhishek Gautam
Abhishek Gautam

Posted on

Binary Quantization: the 1-bit trick that turns terabytes of vectors into pocket-sized fingerprints

“If you can’t explain it with a single sign bit, you probably don’t understand it yet.” — a very anonymous engineer 😜

🧭 1. Why you’re here – the memory wall

You already pip install pgvector, CREATE EXTENSION vector, and happily insert 1024-D OpenAI embeddings as vector(1024) rows.
At 32-bit float precision, 1 M vectors × 1024 dims × 4 B ~ 4GB.
At 100 M vectors that’s 400 GB – a single m7g.8xlarge instance cannot even hold the index in RAM.
Binary Quantization keeps only the sign bit of every dimension (+1 or –1) + the original L2 norm.
Same 100 M vectors shrink to ≈ 12.8 GB of sign bits + 0.4 GB of norms – 32× smaller – while recall drops only 2–4 % after a cheap re-ranking step.

In this article, we'll:

  1. Ground ourselves in the distance measures we will use.
  2. Unpack the Chakra (angular) intuition behind the binary codes.
  3. Show how to implement binary quantized indexes in PostgreSQL's pgvector.
  4. Walk through full precision vs binary quantized search both with and without re-ranking.

Quick De-tour Hamming distance & L2 distance

1️⃣ Hamming Distance

  • What are we comparing? Two equal‑length bitstrings, e.g.
  u = 1 0 1 1 0 1  
  v = 1 1 0 1 0 0
Enter fullscreen mode Exit fullscreen mode
  • Game rule: Count how many positions have different bits.

    • Position 1: 1 vs 1 → same
    • Position 2: 0 vs 1 → different
    • Position 3: 1 vs 0 → different
  • Hamming distance = total “spots” that differ.

    • Here: differences at positions 2, 3, 6 ⇒ Hamming = 3.
  • Why It Matters: Once vectors are bits, Hamming distance (computed via XOR+popcount) gives a lightning‑fast proxy for angular closeness.
    Analogy: Spot‑the‑Difference in two pictures—each mismatch is a “hit” on your scorecard.

2️⃣ L₂ Distance (Euclidean Distance)

  • What are we comparing? Two real‑valued vectors, e.g.
  x = [3, –1, 2]  
  y = [0,  2, 1]
Enter fullscreen mode Exit fullscreen mode
  • Game rule: Imagine each vector as a point in 3‑D space. The L₂ distance is the length of the straight line joining x and y.

$$
d_2(x,y) = \sqrt{(3–0)² + (–1–2)² + (2–1)²}
= \sqrt{3² + (–3)² + 1²}
= \sqrt{9 + 9 + 1}
= \sqrt{19}.
$$

Analogy: The shortest path between two cities on a flat map.

🔗 Why Both Matter for Quantization

  • Hamming distance gives a binary proxy for “angle” or “similarity” when you compress vectors to bits.
  • L₂ distance (and its cousin, cosine similarity) is the gold standard for comparing the original float vectors.

In our binary‑quantization workflow, we’ll use Hamming as the fast filter, then L₂ (or cosine) on the original floats to refine the final result.


📚 2. A gentle-to-deep walkthrough of Binary Quantization

✅ 2.1 Beginner View (What’s the trick?)

Let’s say you have a vector:

x = [3.2, -0.4, 7.1, 0.0, -2.5, ...]
Enter fullscreen mode Exit fullscreen mode

This is a vector of real numbers (say, length 1024), like you'd get from an embedding (e.g., OpenAI, BERT, etc.).

Now imagine you want to store millions of these — the memory adds up FAST. So here’s a storage trick:

👉 Step-by-step idea:

  1. Throw away the exact values, just keep the sign:
  • Positive = +
  • Negative =
  • (Usually zero is treated as positive.)

So we get:

   sign(x) = [+, –, +, +, –, ...]
Enter fullscreen mode Exit fullscreen mode
  1. Encode signs as bits:
  • +1
  • 0

So now this vector becomes a bit string:

   10110...
Enter fullscreen mode Exit fullscreen mode
  1. Store the original magnitude (optional):
  • Compute the length of the original vector, called its L2 norm: ||x||₂
  • Store this as a single float (4 bytes).

This means instead of storing 1024 floats (1024 × 4 bytes = 4 KB), you store:

  • 1024 bits = 128 bytes
  • Plus one float (magnitude) = 4 bytes

Total: ~132 bytes instead of 4096 bytes! 🎉


🧠 2.2 Intermediate View (Why is this useful?)

Even though you’ve thrown away the actual values, you still want to do things like compare vectors (e.g., using cosine similarity or dot products).

So how does comparing just the signs work?

Let’s define:

  • b(x) = binary version of x, where each element is +1 or –1 depending on the sign
  x = [ 3.2, -0.4, 7.1 ] → b(x) = [ +1, –1, +1 ]
Enter fullscreen mode Exit fullscreen mode

Now if you take two binary vectors b(x) and b(y), their dot product (i.e. sum of element-wise products) can be expressed in terms of Hamming distance:

Formula:

b(x) · b(y) = (# of matching signs) – (# of differing signs)
            = d – 2 × Hamming(b(x), b(y))
Enter fullscreen mode Exit fullscreen mode

Where:

  • d is the number of elements (e.g., 1024)
  • Hamming distance = number of positions where the bits differ

What does this give us?

It gives you an approximate similarity score:

  • Small Hamming distance → more similar
  • Large Hamming distance → more different

And what about cosine similarity?

Cosine similarity is defined as:

cos(θ) = (x · y) / (||x|| * ||y||)
Enter fullscreen mode Exit fullscreen mode

Since we stored the signs (b(x)), and separately stored the magnitude (||x||), we can roughly approximate cosine similarity by:

  1. Using Hamming distance to filter similar vectors
  2. Optionally recovering a more accurate similarity in second step

🔬 2.3 Advanced View (Why this approximation works surprisingly well)

Let’s assume x and y are random unit vectors (i.e., their length is 1). Then some deep math shows:

The expected dot product of their sign vectors is:

E[b(x) · b(y)] = (2/π) × arcsin(cos(θ))

What this means:

  • Even though we only stored the signs (i.e. +1/–1), the dot product still tracks the original cosine similarity quite well.
  • So binary dot product ≈ arcsin of cosine similarity
  • We can even invert this if needed.

In practice:

  • People often skip the arcsin (for speed), and use Hamming distance as a fast approximation.
  • Then, on the top k closest vectors (say, top 1000), we compute the exact cosine using original vectors.

This is called a two-stage retrieval:

  1. Fast filter: binary Hamming distance
  2. Slow rerank: exact cosine similarity

Every metric you will ever need - explained

Metric What it really means pgvector 0.8.0 operator Good target
Index size RAM needed to keep HNSW graph in memory pg_relation_size('idx_name') < 30 % of float index
Build time CREATE INDEX wall clock psql \timing linear in ef_construction
QPS queries per second under steady load pgbench -P 1 -T 60 ↑ with smaller vectors
p99 latency 99 % of queries finish below this EXPLAIN (ANALYZE, BUFFERS) < 5 ms for chat UX
Recall\@k % of true top-k returned ANN-Benchmarks ≥ 90 % for RAG

Which Index to Use with Which Metric

Choosing the right index is like picking the right vehicle for your road trip. 🚗🏍️🚐

Index Type Best Metric When to Use
FLAT L₂ / Cosine Brute‑force exact search. Ideal for small datasets or one‑off analyses where speed isn’t critical.
IVF L₂ / Cosine “Partition your vectors into Voronoi cells” – good for medium‑large data. Tweak nlist (clusters) & nprobe (cells to search) for speed vs. accuracy.
HNSW L₂ / Cosine Graph‑based, super low‑latency, high‑recall. Go‑to for real‑time apps (recommendations, search engines).
Binary‑HNSW Hamming Compressed graph on bitcodes—lightweight, blazing Hamming ops for initial filter; rerank with full floats.

IVF is like speed‑dating your vectors—quickly cluster into small groups. HNSW? More like a LinkedIn network—you hop graph‑links. FLAT? Well, that’s a group hug: you compare everyone to everyone. 🙃


SQL Schema

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE docs (
    id        bigserial PRIMARY KEY,
    title     text,
    body      text,
    full_vec  vector(1536),          -- original for re-rank
    sign_bits bit(1536),             -- 1-bit signature (192 bytes)
    vec_norm  real                   -- 4-byte scalar
);

-- Optional GIN for exact KNN on sign_bits
CREATE INDEX docs_sign_gin ON docs USING gin (sign_bits gin_trgm_ops);

-- HNSW on original vector (for ANN baseline)
CREATE INDEX docs_vec_hnsw ON docs USING hnsw (full_vec vector_cosine_ops)
WITH (m=24, ef_construction=256);
Enter fullscreen mode Exit fullscreen mode

Two‑Phase Search: ANN + KNN Hybrid in SQL

Imagine a bouncer at a club who first does a quick glance (ANN filter) and then a proper ID check (exact KNN). In SQL:

-- Phase 1: ANN filter via HNSW + binary quantization
WITH candidates AS (
  SELECT id, binary_quantize(embedding)::BIT(3072) AS code
  FROM documents
  ORDER BY code <~> binary_quantize($1)  -- Hamming distance
  LIMIT @ef_search
)
-- Phase 2: Precise rerank via KNN (cosine or L2)
SELECT c.id
FROM candidates c
JOIN documents d ON d.id = c.id
ORDER BY d.embedding <=> $1  -- exact distance
LIMIT @k;
Enter fullscreen mode Exit fullscreen mode
  1. candidates: fast bit‑ops over compressed codes.
  2. rerank: join back to full embeddings for exact sorting.

This hybrid gives the best of both worlds: speed + accuracy.
The coarse scans (1M rows) and then re-rank - 1000 vectors.


When to Use Binary Quantization in RAG

Retrieval‑Augmented Generation (RAG) pipelines often embed documents and user queries, then fetch top‑k for context injection. When should you slice those embeddings down to bits?

Scenario Use Binary Quantization?
Massive document corpora (100M+ vectors) ✅ Yes: storage & memory are at a premium. Use BQ + rerank.
Low‑latency chatbots (sub‑100 ms targets) ✅ Yes: Hamming = nano‑seconds per comparison.
Small knowledge bases (< 100k docs) ❌ Probably not: FLAT or IVF with scalar quantization suffices.
Fine‑grained accuracy (e.g., legal texts) ❌ No: an extra bit per dimension may lose nuance.

Pro Tip: Azure AI Search reports up to 96 % index size savings and 40 % query latency reduction, while regaining recall via oversampling + reranking.


Experimental Highlights

  • Space Savings: Up to 19× smaller index for 960‑D vectors; 96 % smaller on 1536‑D (Azure AI Search benchmarks).
  • Build‑Time Speedup: 2×–4.5× faster indexing on large dimensions.
  • Recall Trade‑off: Without rerank, recall can plunge to single digits on low‑diversity sets; rerank recovers >90 % on high‑diversity corpora.
  • Throughput Gains: 1.3×–2× QPS boost; 25–30 % p99 latency drop when reranking at moderate ef_search.

Key Takeaways & Recommendations

  1. Pick your index wisely:
  • FLAT for small data; IVF for balanced scale; HNSW for real‑time; Binary‑HNSW for ultra‑light filter.
    1. Always rerank: Binary quantization without rerank is like firing rubber bullets—fast but often inaccurate.
    2. Measure bit‑diversity: High‑dim, varied vectors fare best. If recall lags, scale ef_search or bump rerank size.
    3. Optimize costs: Smaller indexes fit in cheaper instances—big win for RAG at scale.

Future Directions

  • Half‑Precision + Binary: Quantize floats to 16‑bit then to 1‑bit; duel‑compression!
  • SIMD & AVX‑512: PostgreSQL 17 aims to accelerate Hamming distance functions—speed geek dream.
  • Jaccard vs. Hamming: Evaluate bitset vs. set‑based distances in pgvector.
  • Billion‑Scale Benchmarks: How does recall hold up at 1 B vectors?

Conclusion

Binary quantization is not a silver bullet, but in the hands of a discerning engineer it’s a sports car to drive to your destination. Pair it with the right index, a two‑phase hybrid, and reranking, and you’ll tame even the wildest embedding herds—without sacrificing recall or breaking the bank.

Happy vector hunting! 🎯


References
https://qdrant.tech/articles/binary-quantization/?utm_source=chatgpt.com "Binary Quantization - Vector Search, 40x Faster - Qdrant"
https://www.pinecone.io/learn/series/faiss/vector-indexes/?utm_source=chatgpt.com "Nearest Neighbor Indexes for Similarity Search | Pinecone"
https://www.macrometa.com/docs/search-views/semantic-search/concepts/index-type?utm_source=chatgpt.com "Index Type | Macrometa"
https://medium.com/%40noorulrazvi/understanding-index-types-in-vector-databases-when-and-why-to-use-them-46ac9a559994?utm_source=chatgpt.com "Understanding Index Types in Vector Databases: When and Why to Use Them | by Razvi Noorul | Medium"
https://techcommunity.microsoft.com/blog/azure-ai-services-blog/binary-quantization-in-azure-ai-search-optimized-storage-and-faster-search/4221918?utm_source=chatgpt.com "Binary quantization in Azure AI Search: optimized storage and faster search"

Top comments (0)