DEV Community

Cover image for Vector Databases for Data Engineers: Pinecone vs Weaviate vs Qdrant vs pgvector
Gowtham Potureddi
Gowtham Potureddi

Posted on

Vector Databases for Data Engineers: Pinecone vs Weaviate vs Qdrant vs pgvector

vector database is the phrase data engineers hear in every roadmap meeting now that retrieval-augmented generation has moved from a hackathon demo to a production line item — and it is also the phrase that hides the most expensive design decisions of the last five years. The choice of store, index, sharding scheme, and embedding-model upgrade path determines whether a retrieval-heavy product responds in 40 milliseconds at 100 million vectors or chokes a single replica into a one-second tail at 10 million.

This guide is the comparison you wished existed the first time a product manager asked you "should we just use pgvector?" and the answer was longer than a Slack reply. It walks the three real choices — what a vector database actually does, where it sits in the platform, and the four-vendor matrix (pinecone, weaviate, qdrant, pgvector) — then drops into the index-type ladder (hnsw, ivfflat, scalar / product quantization, DiskANN) and the ops surface (memory sizing, multi-tenancy, drift, reindex). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for a vector database tutorial — bold white headline 'Vector Databases' with subtitle 'Pinecone · Weaviate · Qdrant · pgvector · ANN · HNSW' and a stylised constellation of glowing embedding-point orbs orbiting a central index hexagon on a dark gradient with purple, green, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the database design practice library →, rehearse on indexing problems →, and stack the storage-design muscles with system design drills →.


On this page


1. What a vector database actually is (and what it isn't)

A vector database stores high-dimensional embeddings and supports approximate nearest neighbour search — and that is its entire job

The one-sentence invariant: a vector database is a specialised store whose primary access pattern is "find the K rows whose embedding vectors are closest to this query vector," using an approximate-nearest-neighbour index that trades a few percent of recall for orders-of-magnitude speedup over exact KNN. Once you internalise that one sentence, the four vendor matrices, the index-type alphabet soup, and the "do I need this?" decision collapse into a sequence of clean engineering trade-offs.

The two core workloads.

  • Pure similarity search. Given a query vector, return the top-K nearest vectors by cosine, dot product, or L2 distance. The classic retrieval-augmented generation lookup: "embed the user question, find the K closest document chunks, send them to the LLM."
  • Metadata-filtered retrieval. Return the top-K nearest vectors that also satisfy a structured filter — tenant_id = 7 AND lang = 'en' AND published_at > '2026-01-01'. The filter must compose with the ANN index without destroying recall; how a vendor implements this is the single biggest performance differentiator.

Why exact KNN does not scale.

  • Exact K-nearest-neighbour search on N vectors of dimension d costs O(N · d) per query — every vector compared, every dimension touched. At N = 10M and d = 1536 (OpenAI text-embedding-3-small), one query is ~15 billion floating-point operations. A single CPU core takes seconds.
  • Approximate nearest neighbour (ANN) indexes — HNSW, IVFFlat, DiskANN — trade a small recall loss (usually 95–99 percent vs the brute-force baseline) for sub-linear query cost, often O(log N) or sub-linear with quantization. The same query on the same data drops to single-digit milliseconds.
  • The interview-grade statement: "Exact KNN is O(N · d); ANN is sub-linear at the cost of a few percent recall. In production you almost always pick ANN."

Where embeddings come from (out of scope) vs how they are indexed (in scope).

  • Out of scope for this post. The embedding producer — the encoder model (text-embedding-3-small, bge-large, e5-mistral), the batching service, the chunking strategy, the backfill harness. These belong in a separate "embedding pipeline" doc.
  • In scope for this post. The store that receives those vectors, indexes them with HNSW / IVFFlat / DiskANN, filters by structured metadata, replicates them for read scale-out, and serves them back to the application at p99 latency under the SLO.

What a vector database is NOT.

  • It is not a replacement for an OLTP database. Vector DBs are write-heavy at ingestion time and read-heavy at query time, but they do not provide ACID transactions across multiple keys, foreign-key constraints, or complex SQL aggregation. Your orders, users, payments tables stay in Postgres / MySQL.
  • It is not a replacement for an analytics warehouse. A vector store cannot run GROUP BY user_id over a billion rows the way Snowflake or BigQuery can. The query model is "top-K by similarity," not "aggregate by dimension."
  • It is not a search engine on its own. Best-in-class retrieval blends BM25 keyword scoring (Elasticsearch / OpenSearch / Tantivy) with vector similarity. A vector DB handles the dense half; the keyword half lives elsewhere or in a sibling index.

The 2026 reality.

  • pgvector is good enough up to ~10M vectors on a single Postgres replica, and that covers the long tail of products. The "do I even need a dedicated vector DB?" answer is usually "not yet."
  • Pinecone, Weaviate, Qdrant are the dedicated serverless / OSS leaders. Each targets a different operational profile — fully managed, modular OSS, low-latency Rust.
  • HNSW is the default low-latency index across every vendor. IVFFlat is the cheap-memory option. DiskANN is the "I need 100 million vectors on commodity SSD" option.
  • Hybrid retrieval (vector + BM25 + metadata filter) is now the assumed baseline; pure vector search alone underperforms keyword search on exact-match queries by 10–30 percent.

Worked example — the recall-vs-latency knob on HNSW

Detailed explanation. Most engineers meet ANN parameters by tweaking them blind. The clearest mental model is to think of HNSW's ef_search as a "how many candidates do you want to inspect before returning K?" knob: high ef_search gives high recall and high latency; low ef_search gives low recall and low latency. The sweet spot is workload-dependent — but the curve has a well-known shape.

Question. Given 5 million 768-dimensional embeddings indexed with HNSW (M=16, ef_construction=200), describe what happens to recall and p99 latency as you sweep ef_search from 16 to 256. Pick a value for a customer-facing search box with a 100 ms SLO and a "recall must be ≥ 95 percent" requirement.

Input.

ef_search observed recall@10 observed p99 latency
16 81% 4 ms
32 89% 7 ms
64 95% 12 ms
128 98% 22 ms
256 99.3% 45 ms

Code.

# Sweep ef_search and measure recall + p99 against a brute-force baseline.
from time import perf_counter
import numpy as np
import hnswlib

def benchmark(index, queries, gold, ef_values):
    rows = []
    for ef in ef_values:
        index.set_ef(ef)
        latencies = []
        hits = 0
        for q, truth in zip(queries, gold):
            t0 = perf_counter()
            ids, _ = index.knn_query(q, k=10)
            latencies.append((perf_counter() - t0) * 1000)
            hits += len(set(ids[0]) & set(truth)) / 10
        p99 = np.percentile(latencies, 99)
        recall = hits / len(queries)
        rows.append((ef, recall, p99))
    return rows
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. HNSW builds a small-world graph at index time. M=16 controls the average out-degree per node; ef_construction=200 controls how many candidates are considered when inserting a new node.
  2. At query time, ef_search controls how many candidates the search keeps in its priority queue. The search greedily explores neighbours of the closest known candidate, expanding until the candidate set has fewer improvements than ef_search allows.
  3. With ef_search=16, the search stops quickly — it finds plausible neighbours but misses distant-but-close ones. Recall is 81 percent — every fifth result is wrong.
  4. With ef_search=64, the search visits enough candidates to land on the true top-10 with 95 percent probability. Latency stays under the 100 ms SLO with a comfortable margin.
  5. Pushing ef_search to 256 pays a 4× latency cost for a 4-point recall gain. The marginal value is poor at this scale — the model error (the embedding may not perfectly represent semantic similarity) already dwarfs the 4-point ANN error.

Output.

Setting Recall@10 p99 latency Meets SLO?
ef_search=16 81% 4 ms recall fails
ef_search=32 89% 7 ms recall fails
ef_search=64 95% 12 ms yes
ef_search=128 98% 22 ms yes (overkill)
ef_search=256 99.3% 45 ms yes (overkill)

Rule of thumb. Sweep ef_search on your own data and your own embedding model — the curve shifts with both. Pick the smallest ef_search that crosses your recall floor, then leave 30–50 percent latency headroom for traffic spikes.

Worked example — exact KNN vs ANN at 10 million vectors

Detailed explanation. The single most convincing way to internalise why ANN matters is to compute the brute-force cost in your head, then compare it with the ANN cost. The asymmetry is so large that the "do I need an ANN index?" question has only one answer above a few hundred thousand vectors.

Question. Given 10 million 1536-dimensional vectors, estimate (a) the query cost of exact KNN on a single CPU core, (b) the query cost of HNSW on the same hardware, and (c) the recall trade-off. Show why exact KNN is not a serious option above 1 million vectors.

Input.

Quantity Value
N (vectors) 10,000,000
d (dimension) 1,536
Single-core float ops per second ~10 billion

Code.

# Back-of-envelope: exact KNN vs HNSW at scale.
N = 10_000_000
d = 1_536
single_core_flops = 10_000_000_000  # ~10 GFLOPS for AVX2 float32 dot product

# Exact KNN: N * d float multiplies per query
exact_ops_per_query = N * d
exact_seconds = exact_ops_per_query / single_core_flops

# HNSW: typically visits ~log(N) * ef_search nodes,
# each costing d operations for a distance computation.
import math
ef_search = 64
hnsw_nodes_visited = int(math.log2(N) * ef_search)  # ~1500
hnsw_ops_per_query = hnsw_nodes_visited * d
hnsw_ms = (hnsw_ops_per_query / single_core_flops) * 1000
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Exact KNN compares the query to every one of 10 million vectors, each comparison costing 1,536 floating-point multiplies — 1.5 × 10¹⁰ ops per query.
  2. At 10 GFLOPS, that is 1.5 × 10¹⁰ / 10¹⁰ = 1.5 seconds per query on a single core. Even with 32 cores, you do not get below ~50 ms — and you have spent the whole core budget on one query.
  3. HNSW visits ~log₂(N) · ef_search candidate nodes — about 23 · 64 ≈ 1,500 nodes per query at this scale. Each candidate costs 1,536 ops for a distance computation. Total: 2.3 × 10⁶ ops per query.
  4. At 10 GFLOPS, that is 2.3 × 10⁶ / 10¹⁰ = 0.23 milliseconds of pure compute. Add memory latency, graph traversal overhead, and JSON serialisation — real-world p99 is typically 5–15 milliseconds.
  5. The ratio: exact KNN is ~6,500× slower than HNSW at 10 million vectors. The recall cost of HNSW is 1–5 percent. The decision writes itself.

Output.

Method Ops per query Single-core time Recall@10
Exact KNN 1.5 × 10¹⁰ 1,500 ms 100%
HNSW (ef_search=64) 2.3 × 10⁶ 0.23 ms (pure compute) ~95–99%
Ratio ~6,500× ~6,500× -1 to -5 pts

Rule of thumb. Above ~100,000 vectors, always reach for ANN. Below that, exact KNN inside Postgres / NumPy is fine — and avoids an entire piece of infrastructure. The 100K threshold is the "do I need this at all?" cliff.

Worked example — cosine similarity vs dot product vs L2 distance

Detailed explanation. Every vector database supports at least three similarity metrics: cosine similarity, dot product, and L2 (Euclidean) distance. They are not interchangeable — the choice affects both correctness and how you store vectors. Most teams pick wrong on day one and spend two weeks debugging "why does retrieval feel off?"

Question. Given a set of normalised embeddings (unit vectors), explain why cosine similarity and dot product return the same ranking but different absolute scores, and explain when L2 distance gives the same ranking as cosine.

Input.

Vector A Vector B dot(A, B) cosine(A, B) L2(A, B)
[1, 0] [1, 0] 1.0 1.0 0.0
[1, 0] [0.6, 0.8] 0.6 0.6 0.89
[1, 0] [0, 1] 0.0 0.0 1.41
[1, 0] [-1, 0] -1.0 -1.0 2.0

Code.

import numpy as np

def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def l2(a, b):
    return np.linalg.norm(a - b)

# If both vectors are L2-normalised:
# - cosine(a, b) == dot(a, b)
# - l2(a, b)**2 == 2 - 2 * dot(a, b)
# Therefore cosine ranking and L2 ranking are identical on normalised vectors.

a = np.array([1.0, 0.0])
b = np.array([0.6, 0.8])
print(cosine(a, b), np.dot(a, b), l2(a, b))   # 0.6, 0.6, 0.894
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. For unit-length vectors, cosine similarity is dot(a, b) / (1 · 1) = dot(a, b). So cosine and dot product return the same values and the same ranking.
  2. For unnormalised vectors, dot product rewards longer vectors — a longer vector beats a shorter one on dot product even if the angle between them and the query is identical. This is why naked dot product surfaces "popular" documents in production unless you normalise first.
  3. L2 distance and cosine similarity rank the same set of unit vectors identically. Proof: L2(a, b)² = 2 - 2·cos(a, b), which is strictly monotonic decreasing in cosine. Lower L2 ⇔ higher cosine.
  4. If embeddings are not normalised, L2 and cosine can disagree. Always inspect the embedding model's documentation: most modern encoders (OpenAI's text-embedding-3-*, BGE, E5) return unit-length vectors already.
  5. Pick cosine for text similarity; pick dot product when you have learned weights that intentionally encode "importance" via length (rare); pick L2 when the embedding model documents it (image embeddings sometimes use L2). Cosine is the safe default.

Output.

Embedding type Best metric Why
Unit-normalised text cosine (or equivalently dot) semantic angle; length normalised away
Unnormalised text normalise then cosine avoid length bias
Image (CLIP) cosine model trained with cosine loss
Trained ranking model with length signal dot product length encodes weight intentionally

Rule of thumb. Default to cosine. Switch only when the embedding model's documentation explicitly recommends another metric. Mixing metrics across producer and store (encoder uses cosine, store uses L2 on unnormalised) is a silent recall killer.

Vector database interview question on choosing a similarity metric

A senior interviewer often opens with: "Your team is building a documentation search box. The embedding model returns 1536-dimensional vectors and the docs say 'normalised, use cosine'. Walk me through what changes if we accidentally store them in a database that defaults to dot product on unnormalised vectors."

Solution Using a metric-aware pre-normalisation check

# 1) At ingestion: enforce normalised vectors and pick the right metric.
import numpy as np
import pgvector.psycopg  # pgvector example; same idea for Pinecone / Qdrant / Weaviate

def normalise(v: np.ndarray) -> np.ndarray:
    n = np.linalg.norm(v)
    return v if n == 0 else v / n

def ingest(rows, conn):
    with conn.cursor() as cur:
        for doc_id, raw_vec in rows:
            v = normalise(raw_vec)
            cur.execute(
                "INSERT INTO docs (doc_id, embedding) VALUES (%s, %s)",
                (doc_id, v.tolist()),
            )
    conn.commit()

# 2) At query time: use the cosine operator (<=>) in pgvector.
SELECT doc_id, 1 - (embedding <=> %s) AS score
FROM docs
ORDER BY embedding <=> %s
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Effect
ingest normalise raw vector → store as unit vector ‖v‖ = 1.0 for every row
index build pgvector HNSW with vector_cosine_ops tree built using cosine distance
query embedding <=> query_vec (cosine distance) returns 1 - cos(θ); lower is better
score 1 - (embedding <=> query_vec) converts back to similarity (higher is better)
order ORDER BY embedding <=> query_vec ascending distance = nearest first

The trace shows the two invariants you must preserve: every vector at ingestion is unit length, and every query path uses the same metric the index was built with.

Output:

Step Result
Normalisation every vector has L2 norm 1.0 ± 1e-6
Operator pgvector cosine operator <=> returns cosine distance
Top-10 latency (1M docs, HNSW) ~3–8 ms
Recall vs brute force ≥ 98%

Why this works — concept by concept:

  • Normalisation at ingestion — collapses the cosine-vs-dot decision into one. Once every stored vector has unit length, cosine and dot product return the same ranking, and the query path stays simple.
  • Operator-class alignment — pgvector lets you choose vector_cosine_ops, vector_l2_ops, or vector_ip_ops (inner product) when creating the HNSW index. The chosen class must match the query operator (<=>, <->, <#>) or recall silently collapses.
  • Score conversion<=> returns cosine distance (1 - cos), not cosine similarity. Most product code wants similarity; one subtract converts. Display-layer bugs ("the top result has score 0") usually trace to this mismatch.
  • One-metric contract — write down which metric the embedding model uses, which operator class the index uses, and which operator the query uses. All three must agree. This three-line invariant is what keeps the recall regression test green across migrations.
  • Cost — O(1) ingest overhead per row (one norm + divide); O(query) is unchanged because the HNSW index already used cosine at build time. Pure semantic insurance, zero runtime tax.

SQL
Topic — database
Database design problems (SQL)

Practice →


2. The vector DB role in a data platform

The vector database is one stage in a retrieval system — producers feed it, hybrid retrieval and rerankers sit above it

The mental model in one line: a vector database is the dense-retrieval middle layer between an embedding producer (batch / stream / backfill) and a higher-level retrieval composition that blends vector similarity, BM25 keyword scoring, metadata filters, and a reranker. Once you draw that flow on the whiteboard, the "do I need Pinecone or pgvector?" question reframes into "what is my producer throughput and my retrieval composition?" — which is what a senior interviewer actually wants to hear.

Architecture role diagram of a vector database in a platform — left side shows an embedding producer card, middle is a tall 'vector DB' rounded card containing index, sharding, and replication chips, right side shows a hybrid retrieval card branching into a BM25 keyword icon and a metadata filter icon feeding a reranker card above, on a light PipeCode card.

The producer side.

  • Batch embedder. Reads a source table or document store, calls the encoder model in batches of 32–512, writes vectors back into the store. Used for nightly backfills and the initial population.
  • Streaming embedder. Subscribes to a Kafka / Pub/Sub topic of new documents, embeds and upserts in seconds. Used for "freshness matters" use cases — news search, ticket dedupe, product catalogues.
  • Backfill harness. A separate code path that re-embeds the entire corpus when the model is upgraded. Always plan for this — every embedding-model upgrade is a full re-embed.

The store side — index, shard, replica.

  • Index. The data structure (HNSW, IVFFlat, DiskANN) that turns "top-K nearest" into a sub-linear operation. Lives in memory (HNSW, IVFFlat) or on disk (DiskANN, on-disk Qdrant).
  • Shard. Horizontal partitioning by vector_id hash or by tenant. Each shard owns a slice of the corpus; queries fan out to all shards and merge.
  • Replica. Read-replicas of each shard. Doubles or triples read throughput; the write path stays on the primary.

Index types per use case.

  • HNSW — default for low-latency online search. p99 latency 5–25 ms at 10M vectors. Memory-heavy.
  • IVFFlat — cheaper memory, slower writes (requires a training step on a sample). Good fit for "cheap shelf for medium recall."
  • DiskANN — built for 100M+ vectors on commodity SSDs. Higher tail latency, much cheaper RAM footprint.
  • Quantization layers — scalar (int8) and product quantization (PQ) sit on top of any of the above to compress vectors 4–32×.

Hybrid retrieval — vector + keyword + metadata.

  • Pure vector retrieval underperforms BM25 on exact-match queries ("error code 500"); pure BM25 underperforms vector on semantic queries ("login keeps failing").
  • Hybrid = both — score-fuse the two rankings (reciprocal rank fusion is the cheap default) or use a reranker on the union.
  • Metadata filter — pre-filter (push the predicate down into the index) or post-filter (run the ANN, then filter the results). Pre-filter is faster but harder to implement; post-filter is naïve and silently drops recall.

Online vs offline collections.

  • Online collection — serves the live application traffic. Must not be reindexed in place; updates are upserts.
  • Offline collection — sandbox for experiments (new embedding model, different chunking, larger ef_search). Promoted via a blue / green collection swap at the application layer.

Cache and reranker layers above the vector store.

  • Cache. Memoise frequent queries — (query_text → top-K result) in Redis with a 5-minute TTL. Cuts vector traffic 30–60 percent on repeat-heavy workloads.
  • Reranker. Pull top-50 from the ANN, then rerank with a cross-encoder model that scores the (query, document) pair more carefully. Adds 30–80 ms but lifts precision by 5–15 points.

Common interview probes on the role of the vector DB.

  • "What sits between the encoder and the vector DB?" — a batch embedder for backfills, a streaming embedder for live ingestion, a backfill harness for model upgrades. Three code paths, not one.
  • "What sits above the vector DB?" — a hybrid-retrieval composition (vector + BM25 + filter) and optionally a reranker on the top-50 union.
  • "How do you scale reads?" — read replicas behind the shard primary. Sharding gives capacity; replication gives QPS.
  • "What is a blue / green collection swap?" — keep two collections (v1, v2) populated in parallel during a model upgrade, then point the application at v2 atomically when ready.

Worked example — retrieval-augmented generation pipeline shape

Detailed explanation. A retrieval-augmented generation (RAG) stack rarely lives in a single service. Most teams underestimate the number of services it touches — five is typical, eight is normal in production. Drawing the shape forces the trade-off conversation onto firm ground.

Question. Sketch the data flow for a documentation-search RAG service: from user query to LLM answer. Mark every place where latency or cost is paid, and identify the two services where a cache helps most.

Input.

Stage Service Typical latency Notes
1 Query embedding encoder (50–200 ms) one model call
2 ANN retrieval vector DB (5–25 ms) top-50 candidates
3 Metadata filter vector DB (1–5 ms) tenant + lang + date
4 Reranker cross-encoder (30–80 ms) top-50 → top-10
5 LLM call LLM API (300–1500 ms) top-10 as context

Code.

# Simplified RAG flow with two cache layers.
def rag_answer(query: str, tenant_id: int, lang: str) -> str:
    cache_key = (query, tenant_id, lang)

    # Cache 1 — full-answer cache (cuts LLM cost on repeat queries)
    if hit := answer_cache.get(cache_key):
        return hit

    # Cache 2 — retrieval cache (cuts encoder + vector DB cost)
    if not (top_k := retrieval_cache.get(cache_key)):
        q_vec = encoder.embed(query)                              # 50-200 ms
        candidates = vector_db.search(                            # 5-25 ms
            q_vec,
            top_k=50,
            filter={"tenant_id": tenant_id, "lang": lang},
        )
        top_k = reranker.rerank(query, candidates, top_k=10)      # 30-80 ms
        retrieval_cache.set(cache_key, top_k, ttl=300)

    prompt = build_prompt(query, top_k)
    answer = llm.complete(prompt)                                  # 300-1500 ms
    answer_cache.set(cache_key, answer, ttl=300)
    return answer
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The user query enters as plain text. Stage 1 — query embedding — calls the encoder model and produces a 1536-d vector. This is a hard cost on every uncached query.
  2. Stage 2 — ANN retrieval — calls the vector database with the query vector. The filter tenant_id = ? AND lang = ? is pushed down so the ANN only inspects rows that match. Top-50 candidates come back in milliseconds.
  3. Stage 3 (logically — actually part of stage 2 on most vendors) is the metadata filter pushdown; in vendors like Pinecone and Qdrant it is woven into the ANN itself; in pgvector with the wrong query plan it can become a post-filter step that silently drops recall.
  4. Stage 4 — reranker — pulls a cross-encoder model that scores (query, candidate) pairs more carefully than dense similarity. The top-50 are reranked to top-10. This costs 30–80 ms but lifts precision visibly.
  5. Stage 5 — LLM call — sends the top-10 as context to the generative model. Dominates the end-to-end latency budget. Two cache layers (answer cache, retrieval cache) cover both the repeat-query case and the repeat-retrieval-different-prompt case.

Output.

Path End-to-end p50 End-to-end p99
Full cache miss 400–1800 ms 1500–3000 ms
Retrieval cache hit, answer miss 320–1500 ms 1200–2500 ms
Answer cache hit 1–5 ms 10–30 ms

Rule of thumb. The LLM call is the dominant cost; everything else is a rounding error. Two cache layers (retrieval cache for top-K, answer cache for full responses) usually pay for themselves within a week of traffic.

Worked example — pre-filter vs post-filter recall trap

Detailed explanation. A naïve implementation runs the ANN first to get top-K, then applies a metadata filter to the results. If the filter is selective (rare value), the post-filter throws away most candidates and the user sees fewer than K results — a silent recall regression that does not show up in latency dashboards.

Question. Given a 10M-vector corpus where only 1 percent of rows have lang = 'sv' (Swedish), explain what happens with naive post-filter retrieval. Show the fix with pre-filtering.

Input.

Strategy Top-K initial Candidates after filter User sees
Post-filter, K=10 10 0–1 (statistically) near-empty result
Post-filter, K=1000 1000 ~10 OK but slow
Pre-filter, K=10 10 (already filtered) 10 correct

Code.

# WRONG — post-filter
def search_post_filter(query_vec, lang):
    candidates = vector_db.knn_query(query_vec, top_k=10)    # global top-10
    return [c for c in candidates if c.lang == lang]         # often 0

# BETTER — over-fetch and post-filter
def search_overfetch(query_vec, lang):
    candidates = vector_db.knn_query(query_vec, top_k=1000)  # 100x
    filtered = [c for c in candidates if c.lang == lang]
    return filtered[:10]

# CORRECT — pre-filter (filter pushdown into the ANN)
def search_pre_filter(query_vec, lang):
    return vector_db.knn_query(
        query_vec,
        top_k=10,
        filter={"lang": lang},   # vendor pushes this into the index
    )
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The naive post-filter asks the vector DB for the global top-10 across the entire 10M-vector corpus. Statistically, only 1 percent of those will have lang = 'sv' — most often zero.
  2. The over-fetch fix asks for top-1000 instead. Now ~10 will pass the filter on average. The trade-off: 100× more candidates to fetch and re-rank, with corresponding latency cost.
  3. The correct fix is pre-filtering — pushing the predicate into the ANN itself. Pinecone exposes this via the filter argument; Qdrant via must conditions; Weaviate via where; pgvector via a WHERE lang = 'sv' clause that the planner pushes below the index.
  4. With pre-filtering, the ANN only ever inspects vectors that match the filter. Recall is preserved, latency stays flat, and the user always sees K results when K candidates exist.

Output.

Strategy Recall when K=10 Latency Correctness
Naive post-filter ~5% (often 0 results) low broken
Over-fetch K=1000 then filter ~80% 100× higher mediocre
Pre-filter (pushdown) 95%+ normal correct

Rule of thumb. Always check whether your vendor pushes metadata filters into the ANN index versus applying them post-search. The difference shows up at moderate selectivity and gets catastrophic at high selectivity. Selectivity that looks "small" (1 percent) is exactly where the post-filter strategy breaks.

Worked example — sharding and replication topology

Detailed explanation. Sharding gives you capacity (more total vectors); replication gives you read throughput (more concurrent queries). They are orthogonal. A common mistake is conflating them — "we sharded the index" sometimes means "we added a read replica," which solves a different problem.

Question. Given a 200M-vector workload at 4,000 queries per second p99, design a topology. Assume each replica can hold 50M vectors in memory and handle 1,000 QPS at the recall SLO.

Input.

Constraint Value
Total vectors 200,000,000
Peak QPS 4,000
Replica memory budget 50M vectors
Replica QPS budget 1,000

Code.

# Topology calculation.
total_vectors = 200_000_000
peak_qps = 4_000

vectors_per_replica = 50_000_000
qps_per_replica = 1_000

shards = total_vectors // vectors_per_replica          # 4 shards
replicas_per_shard = -(-peak_qps // qps_per_replica)   # 4 replicas
                                                       # serving every shard
nodes = shards * replicas_per_shard                    # 16 nodes
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The 200M-vector total exceeds a single replica's 50M-vector memory budget by 4×. We need 4 shards to fit the corpus, each owning ~50M vectors.
  2. The 4,000 QPS peak exceeds a single replica's 1,000-QPS throughput by 4×. We need 4 replicas per shard to serve the load — every shard must answer every query.
  3. Total node count: 4 shards × 4 replicas = 16 nodes. The application's query router fans out to one replica per shard (4 fan-outs per query) and merges the top-K from each.
  4. The fan-out merge step costs O(shards · K) plus a final sort. At K=10 and 4 shards, this is 40 candidates merged — trivial cost compared to the underlying ANN call.
  5. If the workload grows to 500M vectors, add shards (10 total). If QPS grows to 10,000, add replicas per shard (10 replicas × 10 shards = 100 nodes). The two axes scale independently.

Output.

Resource Value
Shards (capacity) 4
Replicas per shard (QPS) 4
Total nodes 16
Fan-out per query 4 (one per shard)
Merge cost O(40) at K=10

Rule of thumb. Sharding is for capacity. Replication is for throughput. Compute each independently from the workload, then multiply. Conflating the two is the most common topology bug — and the one that makes "we need a bigger vector DB" sound louder than the actual workload demands.

Vector database interview question on platform topology

A senior interviewer often frames this as: "Walk me through the data flow for a documentation search service from raw doc to user answer. Where would you add a cache, where would you shard, and what changes when the embedding model is upgraded?"

Solution Using a layered RAG topology with explicit cache and blue / green collection

# 1) Producer — batch + streaming embedders feeding one online collection.
class EmbeddingProducer:
    def upsert_batch(self, docs):
        vecs = self.encoder.embed_batch([d.text for d in docs])
        self.vector_db.upsert(self.collection, list(zip(docs, vecs)))

# 2) Retrieval — pre-filter ANN with cache.
class Retriever:
    def search(self, query, tenant_id, lang, k=10):
        key = (query, tenant_id, lang, k)
        if hit := self.retrieval_cache.get(key):
            return hit
        q_vec = self.encoder.embed(query)
        candidates = self.vector_db.search(
            self.collection,
            q_vec,
            top_k=50,
            filter={"tenant_id": tenant_id, "lang": lang},
        )
        top_k = self.reranker.rerank(query, candidates, top_k=k)
        self.retrieval_cache.set(key, top_k, ttl=300)
        return top_k

# 3) Model upgrade — blue / green collection swap.
class ModelUpgrade:
    def run(self, new_model):
        green = self.vector_db.create_collection("docs_v2", dim=new_model.dim)
        self.backfill(new_model, green)               # re-embed full corpus
        self.dual_write(green)                        # mirror live writes
        self.compare_metrics(blue="docs_v1", green="docs_v2")
        self.application.point_at("docs_v2")          # atomic switch
        self.vector_db.drop_collection("docs_v1")     # tombstone after soak
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Service touched Latency contribution Cost contribution
1 — query in application 0 0
2 — answer cache check Redis 1 ms $0 if hit
3 — retrieval cache check Redis 1 ms $0 if hit
4 — encoder call encoder API 50–200 ms $0.0001 / call
5 — vector DB pre-filter search vector DB 5–25 ms replica RAM
6 — reranker cross-encoder 30–80 ms GPU seconds
7 — LLM call LLM API 300–1500 ms $0.001 / call

The trace identifies the two cache wins: the retrieval cache covers stage 4 + 5 + 6 (the dense half of the pipeline) and the answer cache short-circuits everything.

Output:

Topology axis Value
Shards 4 (50M vectors each)
Replicas per shard 4 (1,000 QPS each)
Online collection docs_v1 (current model)
Offline collection docs_v2 (new model, populating)
Cache layers retrieval + answer (Redis)

Why this works — concept by concept:

  • Pre-filter pushdown — the metadata filter is woven into the ANN call, so recall stays at 95%+ even when the filter selectivity is 1 percent. Post-filter strategies silently drop the result count below K.
  • Retrieval cache — memoises (query, tenant, lang) → top-K for 5 minutes. Cuts encoder + vector DB cost 30–60 percent on repeat-heavy traffic without affecting freshness in any meaningful way.
  • Reranker on top-50 — the dense ANN gives a generous recall pool; the cross-encoder rerank pushes precision up by 5–15 points. The cost is one extra forward pass on 50 pairs — modest at GPU prices.
  • Blue / green collection swap — the embedding-model upgrade is the riskiest day-2 operation. Building a sibling collection lets you keep the old one serving while the new one fills, then atomically flip the application pointer.
  • Cost — encoder + LLM dominate variable cost; vector DB dominates fixed cost (replica RAM). Cache amortises both. Total p99 SLO is bounded by LLM latency.

SQL
Topic — design
System design problems (SQL)

Practice →


3. Pinecone vs Weaviate vs Qdrant vs pgvector — vendor comparison

The four-vendor matrix is really a two-axis pick: how managed do you want it, and what scale do you have today

The mental model in one line: pgvector is the "reuse Postgres" first choice up to ~10M vectors; pinecone is the fully managed serverless option; weaviate is the modular OSS pick when you need RAG modules and a GraphQL API; qdrant is the low-latency Rust option with strong payload filtering. Once you can recite that, the "which one should we use?" interview question becomes a fifteen-second answer instead of a fifteen-minute survey.

Four-column vendor comparison card — Pinecone (purple), Weaviate (green), Qdrant (orange), pgvector (blue) each shown as a tall rounded card with a header strip, a tagline, and four feature badges (hosting, index types, scale sweet spot, key strength), on a light PipeCode card.

The vendor matrix in one table.

Vendor Hosting Index types Scale sweet spot Key strength Key trade-off
Pinecone Fully managed (serverless + dedicated) HNSW (proprietary) 10M–10B vectors zero-ops, multi-tenant, serverless closed source; lock-in
Weaviate OSS + managed HNSW, IVF, flat 1M–500M vectors RAG modules, GraphQL API, BYO encoder heavier surface area
Qdrant OSS + managed HNSW (in-memory or on-disk) 1M–1B vectors strong payload filter, Rust performance smaller ecosystem
pgvector Postgres extension HNSW, IVFFlat up to ~10M (single replica) reuse Postgres, SQL joins scale ceiling

Pinecone in detail.

  • Hosting model. Fully managed serverless or dedicated. No nodes to operate, no index to rebuild — the vendor handles everything.
  • API surface. REST + gRPC. Concepts: index, namespace (for multi-tenant), upsert, query, filter.
  • Sweet spot. Teams that want to ship retrieval without an ops team. Workloads from 1M to 10B vectors.
  • Trade-off. Closed source; vendor lock-in; per-vector pricing that compounds at large scale. Compliance teams sometimes block it (no on-prem mode).

Weaviate in detail.

  • Hosting model. Open-source binary, self-hosted, or managed cloud.
  • API surface. GraphQL is the canonical API; REST also supported. Concepts: class (schema), object, vectorizer (built-in encoder modules).
  • Sweet spot. Teams that want OSS with batteries included — built-in RAG modules (text2vec-openai, text2vec-cohere), generative search, hybrid search out of the box.
  • Trade-off. Heavier surface area than the others — the schema, the modules, the GraphQL layer all add concepts. Steeper learning curve.

Qdrant in detail.

  • Hosting model. Open-source binary in Rust, self-hosted, or managed cloud.
  • API surface. REST + gRPC. Concepts: collection, point, payload (metadata), filter (the strongest payload filter language of any vendor).
  • Sweet spot. Low-latency online search with heavy filtering. Rust performance shows up on the tail latency curve.
  • Trade-off. Smaller ecosystem than Pinecone / Weaviate. Fewer prebuilt integrations.

pgvector in detail.

  • Hosting model. A Postgres extension — runs anywhere Postgres runs (RDS, Aurora, Cloud SQL, on-prem).
  • API surface. Native SQL. Indexes: HNSW (USING hnsw) and IVFFlat (USING ivfflat).
  • Sweet spot. Teams that already operate Postgres and need vectors as one more column. Up to ~10M vectors on a single replica before reindex windows and memory budgets become painful.
  • Trade-off. Scale ceiling. Above ~10–50M vectors on commodity hardware, a dedicated vector DB pulls ahead on both latency and operations.

Latency, throughput, recall — order-of-magnitude landmarks.

Scale Vendor Typical p99 latency Typical replica memory
1M vectors any of the four 2–10 ms 4–6 GB
10M vectors Pinecone / Weaviate / Qdrant 5–25 ms 30–60 GB
10M vectors pgvector HNSW 10–60 ms 30–60 GB
100M vectors Pinecone / Qdrant DiskANN 10–50 ms 50–200 GB SSD
100M vectors pgvector unsupported in practice
1B+ vectors Pinecone (sharded) / DiskANN OSS 20–100 ms TB of SSD

These are not benchmarks — they are landmarks for "what scale fits which vendor without heroic engineering." Real benchmarks must always be run on your data and your queries.

When pgvector is enough.

  • Corpus under 10M vectors, single tenant or a small fixed number of tenants.
  • Throughput under 200 QPS sustained.
  • Heavy joins to other Postgres tables (orders, users, products) where the locality of "vector + structured fact" matters more than maximum throughput.
  • Ops team that already runs Postgres at scale — no new infrastructure to take on.

When you need a dedicated vector DB.

  • Corpus above ~50M vectors, or growing fast enough that you will cross 50M within the year.
  • Multi-tenant SaaS with hundreds or thousands of namespaces.
  • Strict p99 latency SLO under 25 ms at high QPS.
  • Embedding model upgrades are routine (more than once a quarter) — managed vendors give you the blue / green collection primitive cheaply.

Worked example — picking a vendor by scale and operational profile

Detailed explanation. Pretend you are sitting in front of a whiteboard with a product manager who has just announced "we are doing RAG." Most teams ship the wrong vendor on day one because they pick by brand awareness rather than by scale and operations. A two-question filter usually gets the right answer in 60 seconds.

Question. Given a small RAG product (target launch — 1 month) with 500K documents and a single tenant, which vendor should the team start with? When does it stop being the right choice?

Input.

Dimension Value
Initial corpus 500K documents → ~2M chunks
Tenants 1 (single-tenant)
Expected QPS 50
Existing infra Postgres (RDS)
Ops team size 3 engineers, no dedicated DB ops

Code.

-- pgvector setup — five lines, no new service.
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE doc_chunks (
    chunk_id    BIGSERIAL PRIMARY KEY,
    doc_id      BIGINT       NOT NULL,
    tenant_id   INT          NOT NULL,
    lang        TEXT         NOT NULL,
    text        TEXT         NOT NULL,
    embedding   VECTOR(1536) NOT NULL
);

CREATE INDEX doc_chunks_hnsw
ON doc_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Online query with metadata filter.
SELECT chunk_id, text, 1 - (embedding <=> $1) AS score
FROM doc_chunks
WHERE tenant_id = $2 AND lang = $3
ORDER BY embedding <=> $1
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The corpus is 2M vectors. Well below the 10M scale ceiling — pgvector fits comfortably on a single Postgres replica.
  2. The team already operates Postgres. Adding pgvector is CREATE EXTENSION vector + an index — no new service, no new on-call rotation, no new pricing dimension.
  3. The expected throughput is 50 QPS. A single replica with HNSW serves this with p99 under 20 ms — plenty of headroom for the first year of growth.
  4. The metadata filter (tenant_id, lang) is a standard B-tree WHERE pushdown. Postgres's query planner generally handles this well, but the team should verify with EXPLAIN ANALYZE that the HNSW index plus the filter compose without falling back to a sequential scan.
  5. The migration trigger to a dedicated vector DB: when the corpus crosses ~10M vectors or when QPS sustained crosses ~200 or when the team starts running multi-tenant with hundreds of namespaces. Until any of those flip, pgvector is the right pick.

Output.

Decision Choice Why
Day-one pick pgvector reuse Postgres; minimal ops surface
Migration trigger 1 corpus > 10M vectors reindex windows become painful
Migration trigger 2 sustained QPS > 200 single-replica throughput ceiling
Migration trigger 3 many tenants (100+) namespace primitives in dedicated vendors

Rule of thumb. Start with pgvector if you already run Postgres. Add a dedicated vector DB when you cross one of the three migration triggers above — not before, not because of brand. The wrong vendor in week one is recoverable; the wrong vendor with 50M vectors loaded is not.

Worked example — Pinecone serverless vs Qdrant self-hosted on the same workload

Detailed explanation. A common interview ask is "compare two specific vendors." The trap is that comparing on features alone misses the operational delta. Two vendors with similar feature surfaces can have a 10× difference in ops cost.

Question. Given a 30M-vector workload at 1,500 QPS with a 25 ms p99 SLO, compare Pinecone serverless and Qdrant self-hosted on three axes: time-to-first-query, monthly cost order of magnitude, and the operational events the team must handle.

Input.

Vendor Hosting model Pricing axis Team ops responsibility
Pinecone serverless fully managed per-vector + per-query none
Qdrant self-hosted OSS on EC2 / EKS infrastructure only node lifecycle, upgrades, backups

Code.

# Pinecone serverless — minutes to first query, no infra to provision.
pinecone create-index docs-prod \
  --dimension 1536 \
  --metric cosine \
  --cloud aws --region us-east-1

# Qdrant self-hosted — Helm chart on EKS plus node group capacity planning.
helm install qdrant qdrant/qdrant \
  --set persistence.size=300Gi \
  --set replicaCount=4
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Time-to-first-query. Pinecone serverless: create-index plus credentials, you are upserting in minutes. Qdrant self-hosted: provision EKS, configure the Helm chart, size the node group, set up TLS and IAM — typically a week of work the first time.
  2. Monthly cost. Pinecone serverless charges per stored vector and per query — predictable, scales linearly. Qdrant self-hosted pays only EC2 + EBS — flat curve once the cluster is up. Crossover point is workload-dependent, typically around 50–100M vectors for Pinecone to become more expensive than self-hosted.
  3. Operational events with Pinecone: none — vendor handles upgrades, replication, failover. Operational events with Qdrant self-hosted: node upgrades, snapshot backups, capacity scaling, on-call rotation. Different team-size implication.
  4. Migration cost. Both expose REST + gRPC; switching is painful but not impossible. The data export side is well-supported on both. The application side is one client library swap.
  5. The right answer. "Pinecone serverless until either cost crosses our self-hosted floor or compliance requires on-prem. Then Qdrant self-hosted." The decision is a function of team size, compliance, and projected scale — not a feature checklist.

Output.

Axis Pinecone serverless Qdrant self-hosted
Time-to-first-query minutes days to a week
Monthly cost at 30M / 1500 QPS predictable, mid-tier flat once provisioned
Ops events / month 0 1–4 (upgrades, capacity, backups)
On-prem option no yes
Cost crossover ~50–100M vectors

Rule of thumb. When team size is small and compliance permits, managed services trade money for time and they almost always win the first two years. When team size is comfortable and compliance requires on-prem, self-hosted Qdrant or Weaviate trades time for money — both legitimate trade-offs.

Worked example — Weaviate's hybrid search out of the box

Detailed explanation. Weaviate is the only vendor of the four with first-class hybrid search built in — the team supplies alpha, a single knob that blends BM25 and vector scoring. It is the smallest amount of code for a hybrid retrieval baseline.

Question. Given a knowledge base loaded into Weaviate with the text2vec-openai module, write a hybrid search query that blends keyword and vector scoring with alpha = 0.5.

Input.

User query Pure vector top-1 Pure BM25 top-1 Hybrid top-1 (alpha=0.5)
"error code 500" "the server failed" "error code 500 troubleshooting" "error code 500 troubleshooting"
"why won't my user log in" "login flow troubleshooting" "log file format" "login flow troubleshooting"

Code.

# Weaviate hybrid search — alpha blends vector and BM25.
import weaviate

client = weaviate.Client("https://weaviate-cluster.example.net")

result = (
    client.query
    .get("Document", ["title", "snippet"])
    .with_hybrid(
        query="error code 500",
        alpha=0.5,                        # 0=BM25 only, 1=vector only
    )
    .with_limit(10)
    .with_where({
        "path": ["tenant_id"],
        "operator": "Equal",
        "valueInt": 7,
    })
    .do()
)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Weaviate's with_hybrid operator runs the vector ANN and the BM25 keyword search in parallel, then fuses the two scores with the alpha-weighted formula score = alpha · vector_score + (1 - alpha) · bm25_score.
  2. alpha = 0 is keyword-only (classical BM25). alpha = 1 is vector-only (pure ANN). alpha = 0.5 is an equal blend.
  3. For the "error code 500" query, BM25 hits the exact phrase in the title — high keyword score. The vector might match semantically similar but less exact docs. Hybrid surfaces the exact-match doc on top because BM25's contribution breaks the tie.
  4. For the "why won't my user log in" query, BM25 hits docs containing "log" and "log file" — semantically wrong. Vector hits the login-flow doc semantically. Hybrid keeps the semantically right answer at the top because the vector signal is strong enough.
  5. The with_where predicate is pushed into the ANN as a pre-filter — multi-tenant isolation is enforced before the dense scoring.

Output.

Query alpha Top-1
"error code 500" 0.5 "Error Code 500 troubleshooting"
"why won't my user log in" 0.5 "Login flow troubleshooting"
"what is a vector database" 0.5 "Introduction to vector databases"

Rule of thumb. Hybrid search with alpha ≈ 0.5 outperforms either pure mode on broad evaluation sets — typically lifting top-1 precision by 5–15 points. Sweep alpha on your own eval set; the optimum is rarely exactly 0.5, often 0.3–0.7.

Vector database interview question on the right pgvector vs Pinecone trade-off

A senior interviewer often opens with: "We have 8M document chunks in Postgres today and we are exploring RAG. The team is two engineers. Should we stay on pgvector or move to Pinecone?" The interviewer wants to hear the reasoning, not the brand.

Solution Using a scale + ops + compliance decision matrix

# Decision criteria — three filters, each with a sharp threshold.

Filter 1 — Scale today and projected 12 months out:
  - Corpus under 10M vectors AND projected under 30M in 12 months → pgvector OK
  - Otherwise → dedicated vector DB

Filter 2 — Ops team capacity:
  - Already running Postgres at production scale → pgvector cheap
  - No DB ops experience → managed (Pinecone) trades money for time

Filter 3 — Compliance / data residency:
  - SaaS or US/EU general workload → either is fine
  - On-prem mandate or strict residency → self-hosted (Qdrant / Weaviate / pgvector)

Decision for the asked case (8M today, 2 engineers, no on-prem mandate):
  - Filter 1: 8M < 10M → pass
  - Filter 2: 2 engineers, already on Postgres → pgvector wins
  - Filter 3: no mandate → pgvector wins
  - Choice: pgvector. Revisit in 6 months when corpus crosses 15M.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Criterion Threshold Asked case Verdict
Corpus size < 10M today, < 30M in 12 months 8M today, ~20M projected pass
Ops team experience already operates Postgres yes, 2 engineers pgvector cheap
Throughput SLO < 200 QPS sustained 80 QPS expected pass
Compliance no on-prem mandate no both options open
Decision pgvector

The trace shows the question is not "is pgvector good enough?" — it is "do any of three sharp thresholds force you off pgvector?" If no threshold flips, the cheapest option wins.

Output:

Decision Action
Start with pgvector on existing Postgres
First reindex window planned 12 months out (or earlier on triggers)
Migration plan trigger corpus > 15M vectors or QPS > 200
Migration plan candidate Qdrant self-hosted (cost-effective) or Pinecone serverless (zero-ops)

Why this works — concept by concept:

  • Scale threshold — pgvector's HNSW lives in shared buffers and OS page cache; performance falls off a cliff above ~10–50M vectors per replica on commodity hardware. Below that threshold, the ANN behaviour is identical to dedicated stores.
  • Ops leverage — adopting a new datastore costs an on-call rotation, monitoring instrumentation, and a learning curve. Postgres already has all three. Reusing it is the cheapest possible day-1 decision.
  • Compliance gate — managed serverless vendors fail certain on-prem and data-residency requirements. The compliance filter must run before the cost or feature comparison or you build a system you cannot ship.
  • Migration trigger discipline — write down the thresholds now, not later. When pgvector starts hurting, the team has a pre-agreed exit, not a new debate.
  • Cost — pgvector adds zero net new infrastructure; the Postgres replica was already paid for. Dedicated stores cost an extra service from day one. The pgvector path is strictly cheaper until a threshold flips.

SQL
Topic — design
Design problems (SQL)

Practice →


4. Index types — HNSW, IVFFlat, quantization, DiskANN

There is no universally best ANN index — only the right one for your workload size, latency SLO, and memory budget

The mental model in one line: hnsw is the default low-latency in-memory index; ivfflat is the cheaper-memory option with a training step; scalar / product quantization compresses vectors 4–32× with a few-point recall trade-off; DiskANN is the "100 million-plus vectors on commodity SSD" play. Once you can map a workload to one of those four buckets, the index choice — and the parameter tuning — falls out almost mechanically.

Four-quadrant index types diagram — top-left HNSW small-world graph illustration, top-right IVFFlat centroids with inverted lists, bottom-left scalar/product quantization shown as compressed grids, bottom-right DiskANN shown as a graph spanning across an SSD disk icon; each quadrant has a small parameter pill row, on a light PipeCode card.

HNSW (Hierarchical Navigable Small World) in detail.

  • Shape. A multi-layer graph where each node connects to a fixed number of neighbours per layer; the upper layers are sparser and act as "long-distance jumps."
  • Build parameters. M (out-degree per node, typical 12–48; default 16). ef_construction (candidate pool at build time, typical 100–400; default 200).
  • Query parameter. ef_search — the candidate pool at query time. Larger → higher recall and higher latency.
  • Strength. Lowest p99 latency of any in-memory index at moderate scale. The default in pgvector, Pinecone, Qdrant, Weaviate.
  • Weakness. Memory-heavy — every vector lives in RAM plus its graph edges. ~(4 · dim + 8 · M) · N bytes baseline.

IVFFlat (Inverted File with Flat quantizer) in detail.

  • Shape. Vectors are clustered into nlist centroids via k-means; each query inspects the closest nprobe clusters and brute-force-searches their members.
  • Build parameters. nlist (number of centroids, typical sqrt(N)). Requires a training step on a representative sample.
  • Query parameter. nprobe (clusters to inspect). Larger → higher recall and higher latency.
  • Strength. Smaller memory footprint than HNSW (only the centroids + inverted lists). Cheaper writes.
  • Weakness. Higher tail latency. The training step is offline and complicates incremental indexing.

Scalar quantization (SQ) in detail.

  • Idea. Store each float32 component as an int8 — 4× compression. Recall drops by 0.5–3 points; latency improves because the distance computation is now cheap integer arithmetic.
  • When to use. Always consider on top of HNSW or IVFFlat when memory is the binding constraint. The recall trade-off is usually acceptable for production retrieval.

Product quantization (PQ) in detail.

  • Idea. Split each vector into m sub-vectors, learn a codebook of 256 entries per sub-vector, store each sub-vector as a 1-byte codebook index. Typical compression 16–32×.
  • When to use. Massive scale (100M+) where SQ alone cannot fit the memory budget. Recall drops by 3–8 points; consider with a rerank-on-top of the original float32 vectors for the top-50.

DiskANN in detail.

  • Shape. Graph index similar to HNSW but explicitly designed to live on SSD. Queries pay sequential SSD reads instead of paying RAM cost.
  • When to use. 100M+ vectors where the RAM-only cost of HNSW exceeds the budget. Pinecone serverless uses a DiskANN-class technique under the hood at large scale; OSS DiskANN is the canonical reference.
  • Trade-off. Higher p99 latency than in-memory HNSW (typically 20–80 ms vs 5–25 ms), but order-of-magnitude cheaper RAM footprint.

Parameter intuition in one line each.

  • M — graph density. Higher M → better recall, more RAM.
  • ef_construction — build effort. Higher → better quality graph, slower index build.
  • ef_search — query effort. Higher → better recall, higher query latency.
  • nlist — IVFFlat coarse clusters. Rule of thumb: nlist ≈ sqrt(N).
  • nprobe — IVFFlat search effort. Higher → better recall, higher query latency.

Common interview probes on index choice.

  • "Why is HNSW the default?" — best latency-recall trade-off for in-memory workloads at moderate scale. The graph structure has logarithmic search complexity.
  • "When would you reach for IVFFlat over HNSW?" — when memory is the binding constraint and you can tolerate higher tail latency. Often paired with PQ.
  • "How does quantization affect recall?" — scalar quantization (int8) loses 0.5–3 points; product quantization loses 3–8 points. Usually worth it for the memory savings.
  • "What is DiskANN for?" — vectors at 100M+ scale on commodity SSDs; trades RAM for predictable disk I/O.

Worked example — picking HNSW vs IVFFlat for a 5M-vector workload

Detailed explanation. A team has 5M vectors of dim=1536 and a single replica with 32 GB RAM. They need to pick between HNSW and IVFFlat. The right answer depends on the recall floor, the latency ceiling, and the RAM budget — but the calculation can be done in two minutes.

Question. Estimate the memory footprint of HNSW with M=16 and IVFFlat with nlist=2000 on this workload. Pick the index given a 25 ms p99 SLO and a 95 percent recall floor.

Input.

Quantity Value
N (vectors) 5,000,000
dim 1536
RAM budget 32 GB
p99 SLO 25 ms
Recall floor 95%

Code.

N = 5_000_000
dim = 1_536

# HNSW memory: ~4 bytes per float component + ~8 bytes per graph edge per node.
M = 16
hnsw_vec_bytes = N * dim * 4              # ~30 GB just for vectors
hnsw_edge_bytes = N * M * 8 * 2           # bidirectional edges
hnsw_total_gb = (hnsw_vec_bytes + hnsw_edge_bytes) / (1024**3)

# IVFFlat memory: vectors stored once + small centroid table.
nlist = 2000
ivf_centroid_bytes = nlist * dim * 4
ivf_vec_bytes = N * dim * 4
ivf_total_gb = (ivf_centroid_bytes + ivf_vec_bytes) / (1024**3)

print(round(hnsw_total_gb, 1), round(ivf_total_gb, 1))
# HNSW ~ 30.5 GB; IVFFlat ~ 28.6 GB
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The vector payload itself — N · dim · 4 bytes for float32 — is 5M · 1536 · 4 = 30.7 GB. This is the same for either index because both need the raw vectors to compute distance.
  2. HNSW adds graph edges: N · M · 8 bytes for the forward edges, doubled for bidirectional. At M=16 and bidirectional, this is ~1.3 GB. HNSW total: ~32 GB — right at the RAM ceiling.
  3. IVFFlat adds the centroid table: nlist · dim · 4 = 2000 · 1536 · 4 = 12 MB. Negligible. IVFFlat total: ~30.7 GB — comfortably under the ceiling.
  4. The RAM picture says IVFFlat wins on memory by 1–2 GB. But the latency picture says HNSW typically beats IVFFlat by 2–3× on p99 at this scale. With a 25 ms SLO and the recall floor at 95 percent, HNSW with ef_search=64 lands at ~12 ms p99 / 95.5 percent recall; IVFFlat with nprobe=20 lands at ~28 ms p99 / 94 percent recall.
  5. The right pick: HNSW, but consider scalar quantization (int8) on top to cut the vector payload from 30.7 GB to ~7.7 GB. Now the replica has 24 GB of headroom and the latency stays in budget.

Output.

Index Memory (GB) p99 latency Recall@10 SLO met?
HNSW M=16 ~32 12 ms 95.5% yes (tight)
HNSW + scalar quantization ~9 10 ms 94% borderline
IVFFlat nlist=2000 ~31 28 ms 94% latency fails
IVFFlat + scalar quantization ~8 22 ms 92% recall fails

Rule of thumb. Default to HNSW for online retrieval workloads below 50M vectors per replica. Layer scalar quantization on top when the vector payload dominates the RAM budget. Reach for IVFFlat only when the workload genuinely tolerates higher tail latency.

Worked example — product quantization at 100M vectors

Detailed explanation. Product quantization (PQ) is the technique that makes 100M-vector workloads fit on commodity hardware. It splits each vector into m sub-vectors, each represented by a 1-byte codebook index. The compression ratio scales with m and the trade-off shows up in recall.

Question. Given 100M vectors of dim=1024 and PQ with m=64 sub-vectors, estimate the memory footprint, the per-vector compression ratio, and the recall delta vs raw float32.

Input.

Quantity Value
N (vectors) 100,000,000
dim 1024
PQ sub-vectors m 64 (so each sub-vector is dim=16)
Codebook entries per sub-vector 256 (1 byte index)

Code.

N = 100_000_000
dim = 1_024
m = 64

# Raw float32 footprint
raw_bytes = N * dim * 4

# PQ footprint: 1 byte per sub-vector per row.
pq_bytes = N * m

# Compression ratio
ratio = raw_bytes / pq_bytes

# Codebook stored once: m codebooks of 256 entries each of size (dim/m * 4) bytes.
codebook_bytes = m * 256 * (dim // m) * 4

print(raw_bytes / 1e9, pq_bytes / 1e9, ratio)
# Raw: 409.6 GB; PQ: 6.4 GB; ratio: ~64x
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Raw float32 storage is 100M · 1024 · 4 = 409 GB. Way past any single-machine RAM budget.
  2. PQ with m=64 stores 64 bytes per vector — one codebook index per sub-vector. Total: 100M · 64 = 6.4 GB. The compression ratio is 4096 / 64 = 64×.
  3. The codebook itself is small: 64 codebooks · 256 entries · 16 floats · 4 bytes = 256 KB. Negligible.
  4. Distance computation with PQ uses asymmetric distance: the query stays in float32, and you precompute the query's distance to every codebook entry in each sub-vector (256 distances per sub-vector, computed once per query). Then the distance from query to a stored vector is a sum of 64 table lookups.
  5. Recall drops by ~5–8 points vs raw float32 ANN. The standard fix is rerank-on-top: pull top-100 with PQ, fetch the original float32 vectors for those 100 rows (still small), and re-score. Recall recovers to within 0.5 points of raw.

Output.

Storage Footprint Recall@10 Distance compute cost
Raw float32 409 GB 100% (baseline) 4 KB per pair
PQ m=64 6.4 GB ~92–95% 64 lookups per pair
PQ + rerank top-100 6.5 GB ~99% PQ + 100 float32 dot products

Rule of thumb. Use PQ at 100M+ scale or when raw float32 storage exceeds the RAM budget by 5× or more. Always pair PQ with a rerank step on the original vectors for the top-50 or top-100 — it recovers most of the recall loss for negligible extra cost.

Worked example — DiskANN for 500M vectors on a single host

Detailed explanation. DiskANN was designed at Microsoft Research for the "billion-scale on a single commodity machine" problem. It builds a graph index similar to HNSW but explicitly tuned to live on SSD — sequential reads dominate, and the graph layout is chosen so that each query touches a small, predictable number of pages.

Question. Given 500M vectors of dim=768 and a budget of 1.5 TB SSD plus 64 GB RAM, describe how DiskANN handles the workload and estimate per-query SSD reads.

Input.

Quantity Value
N (vectors) 500,000,000
dim 768
RAM budget 64 GB
SSD budget 1.5 TB

Code.

N = 500_000_000
dim = 768

# DiskANN stores vectors + graph edges on SSD.
# In-RAM: only a navigation index (top of the graph) + PQ codes for quick filtering.
on_disk_bytes_per_vec = dim * 4 + 32 * 8  # vector + ~32 edges of 8 bytes each
on_disk_total_gb = N * on_disk_bytes_per_vec / (1024**3)

# Per-query SSD reads: typically ~50-150 page reads of 4 KB each.
pages_per_query = 100
ssd_bytes_per_query = pages_per_query * 4096

# At ~250 microseconds per random 4 KB SSD read (NVMe), 100 reads = 25 ms.
ssd_latency_us = 250
p99_ms = pages_per_query * ssd_latency_us / 1000

print(round(on_disk_total_gb, 1), p99_ms)
# On-disk ~ 1551 GB; p99 ~ 25 ms
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Raw float32 storage: 500M · 768 · 4 = 1.42 TB. Fits the 1.5 TB SSD budget with headroom.
  2. The DiskANN graph layout co-locates each node's neighbours on disk so a graph hop is one sequential 4 KB read. Typical per-query path length: 50–150 hops at 100M+ scale.
  3. RAM usage: only the upper-layer navigation index plus per-node PQ codes for quick coarse filtering. At 500M vectors and a 16:1 PQ ratio, the PQ codes take 500M · 48 = 24 GB. Fits inside the 64 GB RAM budget with room for OS cache.
  4. Per-query latency budget: ~100 NVMe page reads at 250 µs each = 25 ms p99. Add CPU time for distance computations: total p99 ~30–60 ms on commodity hardware.
  5. Compared to in-memory HNSW at this scale (which would need ~1.5 TB RAM — economically impossible on a single host), DiskANN trades a 5–10× latency penalty for a 20× cost reduction.

Output.

Index On-disk RAM p99 latency Cost order
Raw float32 HNSW ~1.5 TB 5–15 ms not buildable on commodity hardware
In-memory HNSW + PQ ~100 GB 8–20 ms high-tier hardware
DiskANN ~1.5 TB ~64 GB 25–60 ms commodity hardware

Rule of thumb. Reach for DiskANN (or a Pinecone serverless tier that is implemented similarly) when in-memory ANN would exceed your RAM budget by 5×. Below that threshold, in-memory HNSW with quantization is cheaper and faster.

Vector database interview question on index parameter tuning

A senior interviewer often opens with: "You have HNSW running with default parameters and your eval set shows recall@10 at 88 percent — you need 95 percent. How do you get there without rebuilding the index from scratch?"

Solution Using ef_search first, then build-time M only if needed

# 1) Cheap fix — raise ef_search at query time. No reindex.
client.set_ef(128)                            # was 64
results = client.knn_query(query_vec, k=10)

# 2) Sweep to find the smallest ef_search that crosses 95% recall.
for ef in [64, 96, 128, 160, 192, 256]:
    client.set_ef(ef)
    recall, p99_ms = benchmark(client, eval_set)
    print(ef, recall, p99_ms)

# 3) If even ef_search=256 falls short, the graph itself is under-built.
#    Rebuild with higher M (graph density) and higher ef_construction.
new_index = build_hnsw(
    vectors,
    M=32,                  # was 16 — denser graph
    ef_construction=400,   # was 200 — more candidates at insert time
)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

ef_search Recall@10 p99 latency Action
64 (current) 88% 8 ms start
96 92% 11 ms not yet
128 95.5% 16 ms hit
160 96.5% 22 ms overshoot
256 97.5% 38 ms wasted

The trace shows the recall curve crossing the 95 percent floor between ef_search=96 and ef_search=128. Picking 128 leaves a small recall buffer without exceeding the latency SLO. The "rebuild with higher M" branch is only triggered if even ef_search=256 cannot cross the floor — that is the signal the graph itself is too sparse.

Output:

Parameter Old value New value
ef_search 64 128
Recall@10 88% 95.5%
p99 latency 8 ms 16 ms
Reindex required? no no
M (build-time) 16 16 (unchanged)

Why this works — concept by concept:

  • ef_search is a runtime knob — no reindex required, no downtime, no application change. The fastest fix for a recall miss is always to raise ef_search first.
  • M is a build-time knobM only changes when you rebuild the index. Reach for it when even maxing out ef_search cannot meet the recall floor, signalling the graph has too few neighbours per node.
  • Recall-latency Pareto curve — HNSW exposes a smooth curve: each step up in ef_search buys recall at the cost of latency. The sweet spot is the smallest ef_search that crosses the floor with margin.
  • Eval set discipline — every recall measurement must come from a frozen, representative query set with brute-force gold labels. Without it, "raise ef_search" becomes a superstition rather than an engineering knob.
  • Cost — raising ef_search is free at index time and pays linearly at query time. Doubling ef_search typically doubles per-query CPU but at most adds a few percent to recall — diminishing returns kick in fast.

SQL
Topic — indexing
Indexing problems (SQL)

Practice →


5. Ops, cost, and failure modes

Vector databases feel like Lego until you reindex 50 million rows — plan the day-2 ops up front

The mental model in one line: the day-1 cost of a vector database is memory and storage; the day-2 cost is reindex windows, multi-tenant isolation, embedding-model drift, and backup / replication semantics — and these decide whether the system survives its second year. Once you can name the four day-2 failure modes cold, the "we just need to pick a vendor and ship it" framing collapses into the real conversation: who owns the reindex on Friday at 3 AM?

Three-zone ops and cost card — left zone shows a memory-sizing card with a stacked-bar visual for 1M / 10M / 100M vectors, middle zone shows a multi-tenant namespaces card with three coloured partition swimlanes, right zone shows a drift / reindex card with a blue-green collection swap visual and an embedding-model upgrade ribbon, on a light PipeCode card.

Memory sizing — the only formula you need.

  • Baseline. bytes ≈ 4 · dim · N for float32 vectors. At dim=1536 and N=10M, that is ~60 GB just for the vectors.
  • Index overhead. HNSW adds ~8 · M · N bytes for the graph (typically 1–5 percent of the vector payload). IVFFlat adds the centroid table (negligible).
  • Quantization. Scalar quantization cuts the vector payload 4×; product quantization cuts it 16–32×.
  • Sample table. At dim=768 / dim=1536 / dim=3072, plan for the following per-million-vector RAM cost.
dim float32 RAM / 1M vectors int8 RAM / 1M vectors PQ (16×) RAM / 1M vectors
768 3.1 GB 0.77 GB 0.19 GB
1536 6.1 GB 1.5 GB 0.38 GB
3072 12.3 GB 3.1 GB 0.77 GB

Index build cost vs query cost.

  • Build cost is one-time per generation of the index. HNSW build is roughly O(N · ef_construction · log N) — at 10M vectors and ef_construction=200, expect 30–90 minutes on a single CPU.
  • Query cost is per-request. The whole point of the build is to make the query cheap. Optimise build for off-hours; optimise query for the SLO.
  • Reindex windows. Every embedding-model upgrade triggers a full re-embed and a full reindex. Plan for one a quarter at minimum.

Multi-tenancy — namespaces vs collections vs partition keys.

  • Pinecone. Namespaces inside an index. Cheap, isolated at query time, share an index resource pool. Default for SaaS-style tenancy.
  • Qdrant. Payload field tenant_id plus a filter. Or per-tenant collection if you need hard isolation.
  • Weaviate. Multi-tenant collections (a first-class concept in recent versions) or per-class isolation.
  • pgvector. Add tenant_id as a column with a B-tree index; combine with the HNSW for hybrid pushdown. Or use Postgres partitioning by tenant_id for stronger isolation at higher ops cost.

Metadata filter pushdown — pre-filter vs post-filter recall traps.

  • Pre-filter. The vendor pushes the predicate down into the ANN traversal — only candidates that match the filter are inspected. Recall stays at the index baseline.
  • Post-filter. The ANN returns the global top-K, then the filter is applied after. Selectivity below ~10 percent causes silent recall loss — the user sees fewer than K results.
  • Vendor differences. Pinecone, Qdrant, Weaviate all support proper pre-filter pushdown. pgvector's behaviour depends on the query plan — sometimes the planner does push the predicate below the HNSW index, sometimes it does not. Verify with EXPLAIN ANALYZE on every new query shape.

Drift — embedding model upgrades.

  • Trigger. New encoder model ships (e.g. OpenAI text-embedding-3-small → next generation). Old vectors and new query vectors live in different semantic spaces — recall collapses if mixed.
  • Mitigation. Full re-embed of the corpus plus a blue / green collection swap. Two collections run in parallel during the swap; the application points at one or the other atomically.
  • Cost. A full re-embed at 10M vectors and $0.02 per 1K tokens at typical chunk sizes is ~$100–$500. Cheaper than a recall regression.

Backup, snapshot, replication semantics.

  • Pinecone. Replication and backup are vendor-managed; no team responsibility.
  • Qdrant. Snapshot endpoint writes a consistent copy to disk; replication via Raft consensus in distributed mode.
  • Weaviate. Backup module to S3 / GCS; replication factor configurable per class.
  • pgvector. Standard Postgres backup (pg_basebackup, WAL archiving) — covers vectors automatically because they are just another column.

Common interview probes on day-2 ops.

  • "How would you handle an embedding-model upgrade?" — blue / green collection swap with dual-write during the transition. Compare metrics on a frozen eval set before atomically switching.
  • "What is the worst-case reindex window?" — full re-embed + full ANN rebuild at the target scale. At 50M vectors, plan for 6–24 hours on a single-host build; parallelise if shorter is required.
  • "How do you do multi-tenancy without exploding cost?" — namespaces (Pinecone) or filter-on-payload (Qdrant) share the index resource pool, so cost scales with total vectors, not tenant count.
  • "What is the silent failure mode of metadata filters?" — post-filter on selective predicates drops recall to single digits. Pre-filter pushdown is the only correct implementation.

Worked example — sizing a 25M-vector workload at dim=1536

Detailed explanation. A team has 25M document chunks at dim=1536 and is choosing between float32 HNSW and HNSW + scalar quantization on a 64 GB-RAM replica. The right answer is a 5-minute calculation, not a vendor benchmark.

Question. Compute the RAM footprint for both options and decide which fits the 64 GB budget with at least 30 percent headroom.

Input.

Quantity Value
N 25,000,000
dim 1536
RAM budget 64 GB
Headroom target ≥ 30%
HNSW M 16

Code.

N = 25_000_000
dim = 1_536
M = 16

# Float32 HNSW
fp32_vec_gb = N * dim * 4 / (1024**3)
hnsw_graph_gb = N * M * 8 * 2 / (1024**3)  # bidirectional edges
fp32_total_gb = fp32_vec_gb + hnsw_graph_gb

# Scalar quantization (int8) HNSW
int8_vec_gb = N * dim * 1 / (1024**3)
int8_total_gb = int8_vec_gb + hnsw_graph_gb

print(round(fp32_total_gb, 1), round(int8_total_gb, 1))
# fp32 ~ 149.0 GB (does not fit)
# int8 ~ 41.5 GB (fits with ~35% headroom)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Float32 vector payload: 25M · 1536 · 4 = 153 GB. Massively over budget.
  2. HNSW graph edges: 25M · 16 · 8 · 2 = 6.4 GB — modest compared to the vector payload.
  3. Float32 HNSW total: ~149 GB. Even before considering query memory or OS overhead, the float32 option requires three replicas to hold the corpus — tripling cost.
  4. Scalar quantization (int8) vectors: 25M · 1536 · 1 = 36 GB. Plus the same graph overhead: ~42 GB total.
  5. Headroom check: (64 - 42) / 64 = 34%. Just above the 30 percent target. Acceptable.

Output.

Option Total RAM Fits 64 GB? Headroom
Float32 HNSW ~149 GB no -134%
Int8 HNSW (scalar quantization) ~42 GB yes 34%
PQ 16× HNSW ~10 GB yes 84%

Rule of thumb. Default to scalar quantization when the float32 vector payload exceeds 50 percent of the RAM budget. Reach for product quantization when the float32 payload exceeds 5× the RAM budget or when total vectors exceed 100M.

Worked example — blue / green collection swap on a model upgrade

Detailed explanation. Upgrading the embedding model is the single riskiest day-2 operation. Mixing old and new vectors silently destroys recall — they live in different semantic spaces. The blue / green pattern keeps the old collection live while the new one fills, then atomically switches the application.

Question. Describe the steps to migrate from embedding model v1 (dim=768) to model v2 (dim=1536) for a 20M-vector corpus, with no downtime and no recall regression.

Input.

Stage Old (docs_v1) New (docs_v2)
dim 768 1536
Encoder model v1 model v2
Production traffic yes no (offline)
Vectors 20M 0 → 20M

Code.

# 1) Create the new collection.
vector_db.create_collection("docs_v2", dim=1536, metric="cosine")

# 2) Backfill — re-embed every doc with the new model.
for batch in chunks(all_docs, size=256):
    vecs = model_v2.embed_batch([d.text for d in batch])
    vector_db.upsert("docs_v2", list(zip(batch, vecs)))

# 3) Dual-write — every new doc gets embedded with both models.
def on_new_doc(doc):
    v1 = model_v1.embed(doc.text)
    v2 = model_v2.embed(doc.text)
    vector_db.upsert("docs_v1", [(doc, v1)])
    vector_db.upsert("docs_v2", [(doc, v2)])

# 4) Compare metrics on a frozen eval set.
metrics_v1 = evaluate("docs_v1", eval_set, model_v1)
metrics_v2 = evaluate("docs_v2", eval_set, model_v2)
assert metrics_v2.recall >= metrics_v1.recall - 0.005   # within 0.5 pts

# 5) Atomic application switch.
config.set("active_collection", "docs_v2")
config.set("active_encoder", "model_v2")

# 6) Soak period, then drop the old collection.
time.sleep(soak_window)
vector_db.drop_collection("docs_v1")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Stage 1 creates a sibling collection with the new dimension. The old collection keeps serving traffic — no user impact yet.
  2. Stage 2 backfills the new collection by re-embedding the entire corpus with the new model. This is the expensive step — for 20M chunks at $0.02 per 1K tokens, plan for $100–$500.
  3. Stage 3 dual-writes new docs into both collections so neither one falls behind during the backfill. This phase ends when both collections have the same row count.
  4. Stage 4 evaluates retrieval quality on a frozen eval set with brute-force gold labels. If recall@10 on the new collection is within 0.5 points of the old one (or better), the switch is safe.
  5. Stage 5 is the atomic switch — the application config changes the active collection pointer. Traffic immediately routes to docs_v2. The old collection is still warm; rollback is one config change away.
  6. Stage 6 is the soak — let the new collection serve traffic for a few days, monitor query patterns, then drop the old collection to reclaim the resources.

Output.

Phase Old collection New collection Application points at
1. create full empty old
2. backfill full filling old
3. dual-write full full old
4. eval pass full full old
5. switch full full new
6. drop old dropped full new

Rule of thumb. Always blue / green on model upgrades. Never reindex in place — the cost of a regression caught in production is higher than the cost of the extra collection. The pattern also works for index parameter changes (M, ef_construction) and metric changes (cosine ↔ dot product ↔ L2).

Worked example — multi-tenant namespace isolation in Pinecone

Detailed explanation. SaaS workloads run hundreds or thousands of tenants. A naive design uses one index per tenant — index-creation cost dominates, ops becomes a nightmare. The right pattern is one index plus a namespace per tenant; the namespace is a logical partition inside the shared index resource pool.

Question. A SaaS product has 5,000 customer tenants, each with 1,000 to 100,000 documents. Compare "one index per tenant" vs "one index plus a namespace per tenant" on cost and ops surface.

Input.

Strategy Indexes Namespaces Resource provisioning
One index per tenant 5,000 0 per-tenant capacity
One index plus namespace per tenant 1 5,000 shared capacity

Code.

# WRONG — one index per tenant (5000 indexes; vendor limits + cost)
for t in tenants:
    pinecone.create_index(f"docs-{t.id}", dimension=1536)

# CORRECT — one index, namespace per tenant
pinecone.create_index("docs", dimension=1536)

def upsert(tenant_id, doc, vec):
    pinecone.upsert(
        index_name="docs",
        namespace=f"tenant-{tenant_id}",
        vectors=[(doc.id, vec)],
    )

def query(tenant_id, query_vec, k=10):
    return pinecone.query(
        index_name="docs",
        namespace=f"tenant-{tenant_id}",
        vector=query_vec,
        top_k=k,
    )
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Per-index strategy. Each tenant gets its own index — 5,000 separate ANN structures. Most vendors limit indexes per project (typically 20–200). Even where unlimited, every index has a per-index resource floor (memory, control-plane overhead) that does not scale down to small tenants.
  2. Namespace strategy. One index holds all tenants; the namespace string is appended to every upsert and query. The ANN is shared; the namespace acts as a partition key that the query engine filters on before scoring.
  3. Cost. Per-index strategy pays a fixed cost per tenant; total cost scales with tenants × per-index floor. Namespace strategy pays one fixed cost; total cost scales with total vectors only.
  4. Isolation. Both strategies prevent cross-tenant queries from leaking data. The namespace strategy depends on the vendor enforcing the namespace boundary correctly — this is the security-critical assumption to verify.
  5. Migration cost. Moving from per-index to namespace later is painful (re-upsert everything). Choose namespace from day one for any SaaS workload.

Output.

Axis One index per tenant One index plus namespace
Indexes provisioned 5,000 1
Vendor index-limit risk high none
Per-tenant overhead fixed floor per index negligible
Cost at scale grows with tenant count grows with total vectors
Cross-tenant isolation hard wall logical wall

Rule of thumb. Default to one index with a namespace per tenant for any multi-tenant SaaS workload. The "one index per tenant" pattern is only ever right when tenants have wildly different SLOs and the team is willing to pay the per-index floor 5,000 times.

Vector database interview question on day-2 ops planning

A senior interviewer often closes the loop with: "Walk me through your runbook for the day the embedding-model vendor releases a new generation and a key customer asks if you will support it. What happens in week one, week two, week three?"

Solution Using a four-week blue / green upgrade plan with eval gates

# Week 1 — measure.
- Build an eval set: 500-1000 (query, gold-doc) pairs across all tenants.
- Baseline recall@10 and p99 latency on the current collection.
- Estimate full-re-embed cost: tokens · price-per-1K · 1.2 (safety margin).

# Week 2 — prepare.
- Create offline collection `docs_v2` with the new model's dimension and metric.
- Start the backfill: batch encode + upsert. Throttle to stay within rate limits.
- Begin dual-write at the producer layer (all new docs to both collections).

# Week 3 — evaluate + cutover.
- Run the eval set against `docs_v2` and compare to baseline.
- Gate: recall@10 must be within 0.5 points; p99 latency must be within 10%.
- If gate passes, flip the application pointer atomically.
- Keep `docs_v1` warm for a 5-day soak.

# Week 4 — clean up.
- Drop `docs_v1` after the soak.
- Update runbooks, alert thresholds, and dashboards to point at `docs_v2`.
- Post-mortem: cost actually paid, latency curve, recall delta.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Week Action Risk if skipped
1 freeze eval set + baseline no measurable rollback criterion
2 offline docs_v2 + dual-write new docs missing from v2 on cutover
3 eval gate + atomic switch silent recall regression in prod
4 soak + cleanup resource leak; orphan collection

The trace highlights that each step is an insurance payment against a specific failure mode. Skipping any step exchanges a small predictable cost for a large unpredictable one.

Output:

Metric Before (docs_v1) After (docs_v2) Delta
Recall@10 94.0% 95.2% +1.2 pts
p99 latency 18 ms 17 ms -1 ms
Re-embed cost $320 (one-time)
Downtime 0 seconds

Why this works — concept by concept:

  • Eval set first — without a frozen, representative (query, gold-doc) set, "recall went up" is a vibe, not a measurement. Build the eval set before the upgrade so the baseline is honest.
  • Dual-write during backfill — closes the race window where new docs arrive after the backfill starts but before the cutover. Without dual-write, those docs are missing from v2 on cutover day.
  • Atomic application switch — the application reads a single config knob (active_collection) per query. Changing it is one write to one config row; rollback is just as cheap.
  • Soak window — production traffic exposes failure modes that no eval set covers (long-tail queries, language mix, time-of-day patterns). The soak window catches them while docs_v1 is still recoverable.
  • Cost — re-embed is a one-time variable cost (tokens); the extra storage during cutover is a one-time fixed cost (a second collection's worth of RAM / SSD). Both are bounded and small relative to the cost of a recall regression.

SQL
Topic — database
Database ops problems (SQL)

Practice →


Cheat sheet — vector DB recipes

  • HNSW default parameters. M = 16, ef_construction = 200, ef_search = 64. Tune ef_search first (runtime); rebuild with higher M only if even ef_search = 256 falls short of recall.
  • pgvector HNSW index DDL. CREATE INDEX docs_hnsw ON docs USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200); — choose the operator class (vector_cosine_ops, vector_l2_ops, vector_ip_ops) to match the query operator (<=>, <->, <#>).
  • Hybrid query — vector + metadata filter. Push the predicate into the ANN call: vector_db.search(q_vec, top_k=10, filter={"tenant_id": t, "lang": l}). Verify it is a pre-filter (not post-filter) with the vendor's explain mechanism.
  • Reranker layer. Pull top-50 with the ANN; rerank with a cross-encoder model on the (query, candidate) pairs; return top-10. Lifts precision by 5–15 points for ~50 ms extra latency.
  • Memory baseline. bytes ≈ 4 · dim · N for float32 vectors. At dim=1536 and N=1M, that is ~6 GB per million vectors. Scalar quantization cuts this 4×; product quantization cuts it 16–32×.
  • Drift checklist on a model upgrade. Build offline collection → backfill via re-embed → dual-write new docs → eval gate on frozen set → atomic application switch → soak → drop old collection.
  • Pre-filter vs post-filter. Always confirm your vendor pushes the filter into the index. Selective filters (< 10 percent of rows match) break post-filter strategies silently.
  • Multi-tenant pattern. Default to one index plus a namespace (or payload field) per tenant. Per-tenant indexes are an anti-pattern at SaaS scale.
  • Similarity metric contract. Choose one metric, write it down, enforce normalisation at the producer, and align the index operator class and query operator. Cosine is the safe default for text embeddings.
  • Build cost. HNSW build is O(N · ef_construction · log N); at 10M / ef_construction=200, expect 30–90 minutes on a single CPU. Plan for this every embedding-model upgrade.
  • Quantization combo recipe. PQ on the index for compression; rerank the top-50 with the original float32 vectors. Recovers most of the recall loss for negligible extra cost.
  • DiskANN trigger. Reach for DiskANN (or a vendor tier that uses it under the hood) when float32 + HNSW would exceed 5× your RAM budget. Below that, in-memory HNSW + quantization is cheaper and lower latency.
  • pgvector migration trigger. Move off pgvector when corpus crosses 15M vectors, sustained QPS crosses 200, or tenants cross ~100. Write the threshold down so the migration is a planned event, not a fire drill.

Frequently asked questions

Do I need a vector database or is pgvector enough?

If you already run Postgres, start with pgvector — adding a vector column and an HNSW index is CREATE EXTENSION vector plus one CREATE INDEX. It comfortably serves up to ~10 million vectors per replica at sub-25 ms p99 with M = 16 and ef_search = 64, and it lets you join vectors against your existing relational tables (orders, users, products) without any new infrastructure. The migration trigger to a dedicated vector database (Pinecone, Weaviate, Qdrant) is when your corpus crosses ~15 million vectors or sustained query throughput crosses ~200 QPS or you start running multi-tenant SaaS with 100+ namespaces. Until any of those three thresholds flips, pgvector is the cheapest option by a wide margin.

What's the difference between HNSW and IVFFlat?

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each vector has a fixed number of neighbour edges per layer; queries do a greedy graph walk and return the K nearest. IVFFlat (Inverted File with Flat quantizer) clusters vectors into nlist centroids via k-means; queries inspect the nprobe closest clusters and brute-force-search their members. HNSW has lower p99 latency at moderate scale and supports incremental inserts without a training step; IVFFlat uses less memory and is cheaper to write but has higher tail latency and needs a one-time training step on a representative sample. Default to HNSW unless RAM is the binding constraint — then consider IVFFlat, often combined with scalar or product quantization.

How do I size memory for a vector index?

The baseline formula is bytes ≈ 4 · dim · N for float32 vectors — at dim=1536 (OpenAI text-embedding-3-small) and 10 million vectors, that is ~60 GB just for the payload. HNSW adds a graph overhead of roughly 8 · M · N bytes (about 1–5 percent of the vector payload at typical M=16). Scalar quantization (int8) cuts the vector payload by 4× with a 0.5–3 point recall hit; product quantization (m=64 sub-vectors) cuts it by 16–32× with a 3–8 point recall hit (recoverable to within 0.5 points by reranking the top-50 with original float32 vectors). Plan for 30 percent RAM headroom above the index footprint for query buffers and OS cache.

Can a vector database replace my search engine?

Almost never — best-in-class retrieval is hybrid. Pure vector search underperforms BM25 keyword search by 10–30 points on exact-match queries ("error code 500", "iPhone 15 Pro Max") because dense embeddings smear semantically related but lexically different documents. Pure BM25 underperforms vector search on semantic queries ("why won't my user log in") because keyword matching misses paraphrase. The production pattern is to run both, then fuse the rankings either by reciprocal rank fusion or by a learned reranker. Weaviate exposes this natively via the with_hybrid(alpha=0.5) API; with Pinecone, Qdrant, or pgvector you wire the two retrievers together at the application layer.

What happens when I change embedding models?

Old vectors (encoded with model v1) and new query vectors (encoded with model v2) live in different semantic spaces — mixing them silently destroys recall. The mandatory pattern is a blue / green collection swap: create an offline collection docs_v2 with the new model's dimension, backfill it by re-embedding the entire corpus, dual-write every new document into both collections during the backfill, evaluate retrieval on a frozen (query, gold-doc) eval set, gate on "recall@10 within 0.5 points of baseline," then atomically flip the application pointer to docs_v2 and soak for 5 days before dropping docs_v1. Budget the variable cost (re-embed tokens) — at 10 million chunks and $0.02 per 1K tokens, expect $100–$500 per upgrade. Plan for at least one upgrade per quarter.

Pinecone vs Weaviate vs Qdrant — which should I pick?

Pinecone is fully managed serverless — minutes to first query, zero ops, predictable per-vector pricing; pick it when team size is small and there is no on-prem mandate. Weaviate is OSS plus managed cloud with built-in RAG modules (text2vec-openai, generative search) and first-class hybrid search via with_hybrid(alpha); pick it when you want batteries included and a GraphQL surface. Qdrant is OSS plus managed cloud in Rust with the strongest payload-filter language and on-disk index support; pick it for low-latency online search with heavy structured filtering, especially when on-prem is required. All three handle 1M–500M vectors comfortably; above that, Pinecone (sharded) or DiskANN-class OSS solutions are the only realistic options.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every vector database recipe above ships with hands-on practice rooms where you draw the topology, write the metadata-filter pushdown, and rehearse the blue / green collection swap against graded scenarios. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your sizing math for a 100-million-vector workload actually matches what a senior interviewer expects to hear.

Practice database design now →
Indexing drills →

Top comments (0)