rag pipeline looks like a model problem to a newcomer — interviewers know it is really a four-stage data pipeline problem dressed up in transformer vocabulary. The model is the cheap part; the expensive, error-prone, on-call-rotation part is the ingest, chunk, embed, index, and refresh loop that decides what context the model ever sees. When a retrieval-augmented generation system gives wrong answers in production, the fix almost never lives in the prompt — it lives in the pipeline.
This guide is the cheat sheet you wished existed the first time a stakeholder asked "why does the bot still cite the old policy?" It walks the end-to-end architecture, the four families of chunking strategies (fixed, recursive, semantic, hierarchical), embedding model selection and the metadata sidecar that travels with every chunk, hybrid dense + BM25 retrieval with cross-encoder reranking, and the freshness SLO + reindex playbook that keeps a rag data pipeline from drifting. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the streaming practice library →, rehearse on ETL pipeline problems →, and stack the data-modelling muscles with dimensional modeling drills →.
On this page
- RAG as a data pipeline problem, not a prompting problem
- End-to-end RAG pipeline architecture
- Chunking strategies — fixed, semantic, recursive, hierarchical
- Embeddings + storage — choosing models and shaping the index
- Freshness, reindex, and ACLs
- Cheat sheet — RAG pipeline recipes
- Frequently asked questions
- Practice on PipeCode
1. RAG as a data pipeline problem, not a prompting problem
The model is the cheap part — the pipeline is where 80% of rag pipeline quality issues live
The one-sentence invariant: a RAG system is only as good as the chunks the retriever can return, and the chunks are only as good as the ingest, normalize, split, and embed pipeline that produced them. Once you internalise that "retrieval is the bottleneck," every late-night "why did it hallucinate that fact?" ticket resolves into one of four pipeline questions — was the source ingested? was it chunked at a sensible boundary? was the embedding model right for the content? was the index fresh?
The four stages every rag data pipeline shares.
- Ingest. Pull documents from the source — Confluence, S3, Postgres, Notion, support tickets, code repos — into a staging zone. Strip boilerplate, expand attachments, dedupe. The output is a clean text corpus tagged with source metadata.
- Chunk + embed. Split each document into retrieval-sized units (paragraphs, sections, sliding windows) and feed each chunk through an embedding model to produce a dense vector. Persist the chunk text, the vector, and a metadata sidecar.
- Index. Push every vector into a vector store (Pinecone, Weaviate, pgvector, Qdrant, Milvus) and keep the chunk text in a sidecar store (Postgres, S3) so the retrieval path can hand the model the text, not just the vector ID.
- Retrieve + rerank. At query time, embed the user question with the same model, do an approximate-nearest-neighbour search, fuse with a keyword score (BM25), rerank the top-N with a cross-encoder, and assemble the top-k chunks into the prompt.
Three places quality silently dies.
- At the chunk boundary. A fixed-size chunker that splits mid-sentence loses the conceptual unit. The retriever returns the right vicinity but the model gets half a sentence — and confidently completes it the wrong way.
-
At the metadata boundary. No
tenant_idon a chunk means the multi-tenant retrieval filter has nothing to filter on, and tenant A starts seeing tenant B's documents in the answer. This is the most expensive bug inretrieval augmented generationshipping today. - At the freshness boundary. A nightly batch reindex means "the new policy" is invisible to retrieval until 3am tomorrow. By then the stakeholder has already gone to Slack and screenshotted the wrong answer.
Where data engineers own the work.
- DE owns ingest, chunking, embedding orchestration, vector store schema, metadata sidecar, freshness SLO, ACL pushdown, eval harness ingestion. Everything between source-of-truth and the moment a chunk arrives in the prompt.
- ML / applied scientists own the embedding model choice, the reranker model choice, prompt template tuning, and the eval scoring rubric. Everything that decides "given a chunk, how is it scored / consumed."
- Boundary. The eval harness is shared: DE supplies the golden Q&A inputs and the evaluation runs against the deployed pipeline; ML defines the metrics (recall@k, MRR, faithfulness).
The 2026 reality.
- Vector stores are commoditising fast. pgvector inside Postgres handles low-tens-of-millions of vectors with reasonable latency; Pinecone, Qdrant, Weaviate, and Milvus take over above that. The choice is now a TCO decision, not a feature one.
- Hybrid search is the default. Pure dense retrieval lost ~5-15 points of recall on out-of-distribution keywords vs hybrid; teams now ship dense + BM25 score fusion (RRF or weighted) as table stakes.
-
Reranking is mandatory above ~10 users. Cross-encoders are 50-100x slower than ANN but lift top-k precision dramatically. The standard shape is
top-50 ANN → reranker → top-5 prompt. - Freshness is a contract. The mature pattern is CDC-from-source → embed worker → vector upsert, with a P95 source-to-retrieval lag SLO measured in minutes, not hours.
Worked example — the four ways RAG silently returns the wrong answer
Detailed explanation. Most "the bot lies" tickets fall into four pipeline buckets, and the fastest way to fix them is to know which bucket before you touch the prompt template. The four are: chunk boundary, missing metadata filter, stale index, embedding-model mismatch. Each has a code-shaped fix.
Question. A support bot keeps citing the wrong refund policy for tenant acme. The right policy is in Confluence, but the bot pulls the old one. List the four failure modes that could explain this and show the smallest pipeline-side test for each.
Input. A retrieved-context audit log row per query.
| query_id | tenant_id | retrieved_chunk_id | retrieved_chunk_source | retrieved_chunk_text | last_modified |
|---|---|---|---|---|---|
| q1 | acme | c123 | confluence/acme-refunds-v1 | "Refunds are 14 days..." | 2025-08-01 |
Code.
# Four pipeline-side tests — run each before touching the prompt
# 1) Chunk boundary test: does the right chunk even exist?
chunks = vector_store.search(query="acme refund policy", filter={"tenant_id": "acme"})
for c in chunks[:5]:
print(c.source, c.text[:120])
# expected: at least one chunk from confluence/acme-refunds-v2
# 2) Metadata filter test: is tenant_id present on the new doc's chunks?
new_chunks = vector_store.scan(filter={"source": "confluence/acme-refunds-v2"})
assert all(c.metadata.get("tenant_id") == "acme" for c in new_chunks)
# 3) Freshness test: how stale is the most-recent chunk for this source?
latest = max(c.metadata["last_modified"] for c in new_chunks)
print("source-to-retrieval lag:", now() - latest)
# 4) Embedding model test: same model on both write and read paths?
assert embed_model_id == vector_store.collection.embed_model_id
Step-by-step explanation.
- Test 1 confirms whether the new chunk is retrievable at all. If it is not in the top results for an obvious query, the chunking strategy or the embedding step failed. If it is in the top 50 but not top 5, you have a reranker / score fusion problem, not a retrieval problem.
- Test 2 confirms the metadata sidecar is intact. A common shape: the ingest job upserted text+vector but skipped the metadata payload. Without
tenant_id, the multi-tenant filter pulls nothing foracmeand the system falls back to a cross-tenant default — which still contains the old v1 doc. - Test 3 quantifies the freshness lag. If "now - last_modified" exceeds the SLO, the CDC stream is stalled or the embed worker is backed up. The fix is operational (drain the queue), not algorithmic.
- Test 4 catches the silent killer: the team upgraded the embedding model on the write path but the retrieval service still embeds queries with the old one. Vectors are in different spaces; cosine similarity is meaningless. The fix is a blue/green collection swap, not a hot upgrade.
Output.
| Failure mode | Test signal | Owner |
|---|---|---|
| Chunk missing / mis-split | search returns no v2 source | DE — chunker |
| Missing metadata | no tenant_id on v2 chunks | DE — ingest job |
| Stale index | source-to-retrieval lag > SLO | DE — CDC + worker |
| Embedding model mismatch | embed model IDs differ | ML + DE |
Rule of thumb. Before touching the prompt template, run the four pipeline tests. If all four pass and the answer is still wrong, then the problem is in the model or the prompt — and you can hand it to the ML team with high confidence the data layer is clean.
Worked example — why "just use a bigger context window" does not fix bad chunks
Detailed explanation. A common temptation when a RAG system gives wrong answers is to widen the context window and stuff more chunks into every prompt. This reliably makes the bill explode without lifting accuracy, because the recall problem (the right chunk was not retrieved at all) is not solved by giving the model more wrong chunks. The fix is upstream: better chunking, better embeddings, hybrid + rerank — not a bigger prompt.
Question. Your support bot retrieves the top-3 chunks and returns wrong answers ~12% of the time. A colleague proposes lifting top-k from 3 to 30 (10x bigger context). Quantify why this is unlikely to fix the bug.
Input. Retrieval log breakdown over 1000 wrong-answer queries.
| failure cause | share of wrong answers |
|---|---|
| right chunk not in top-50 at all | 62% |
| right chunk in top-50 but ranked outside top-3 | 25% |
| right chunk in top-3 but model still hallucinated | 13% |
Code.
# Failure mode breakdown — a tiny helper to bucket wrong answers
def diagnose(query, gold_chunk_id, top_n=50):
candidates = vector_store.search(query, k=top_n)
ranks = [i for i, c in enumerate(candidates) if c.id == gold_chunk_id]
if not ranks:
return "not_in_top_n" # recall miss — bigger k will not help
rank = ranks[0]
if rank >= 3:
return "rank_too_low" # rerank fix
return "model_failed" # prompt / model fix
buckets = collections.Counter(diagnose(q.text, q.gold_id) for q in failed_queries)
print(buckets)
# Counter({'not_in_top_n': 620, 'rank_too_low': 250, 'model_failed': 130})
Step-by-step explanation.
- The "lift top-k from 3 to 30" idea only helps the 25% of wrong answers where the right chunk was in top-50 but ranked below 3 — and only partially, because the model now also has 27 extra noisy chunks to confuse it.
- The dominant failure mode (62%) is recall miss — the right chunk is nowhere in the top-50. Increasing k from 3 to 30 changes none of those queries: the chunk is still missing. The fix is upstream — better chunking strategy, hybrid + BM25 score fusion, or re-embedding with a stronger model.
- The "model failed" bucket (13%) is the only one a prompt or model upgrade can fix, and it is the smallest bucket.
- Net: lifting top-k to 30 fixes at most 5-8% of the failures (the easier rerank misses), at 10x the LLM token cost. The right ROI is improving recall and adding a reranker.
Output.
| Action | Failures fixed | Cost change |
|---|---|---|
| Lift top-k from 3 to 30 | ~5-8% | 10x prompt tokens |
| Add hybrid + BM25 fusion | ~40-50% | 2x ingest, ~0 query |
| Add cross-encoder reranker | ~20% (on top of hybrid) | +50ms latency |
| Rechunk with semantic splitter | ~15% | one-time reindex |
Rule of thumb. Diagnose the failure bucket before spending money. "Lift top-k" is the single most-tempting and least-effective RAG fix — every dollar in extra prompt tokens is roughly five dollars not spent on the actual recall problem upstream.
Data engineering interview question on rag pipeline ownership
A senior interviewer often opens with: "Walk me through the four stages of a production RAG pipeline, who owns each stage, and the single most common failure mode in each. Where does a data engineer add the most value?" It blends pipeline architecture, retrieval, and freshness into one ownership map.
Solution Using the stage-ownership matrix
# A minimal pipeline-ownership map — runnable as a doc-test
STAGES = [
{
"stage": "ingest",
"owner": "DE",
"common_failure": "missing source connector or stale CDC stream",
"telemetry": "source_lag_seconds",
},
{
"stage": "chunk_embed",
"owner": "DE (chunker) + ML (model choice)",
"common_failure": "chunk size too small / boundary mid-sentence",
"telemetry": "chunks_per_doc, embed_qps",
},
{
"stage": "index",
"owner": "DE",
"common_failure": "missing metadata sidecar (tenant_id, ACL)",
"telemetry": "metadata_coverage_pct",
},
{
"stage": "retrieve_rerank",
"owner": "DE (infra) + ML (reranker)",
"common_failure": "embedding model mismatch between write and read",
"telemetry": "recall_at_5, mrr_at_10",
},
]
for s in STAGES:
print(f"{s['stage']:>16} owned by {s['owner']:<30} watch: {s['telemetry']}")
Step-by-step trace.
| Stage | Owner | Common failure | Telemetry |
|---|---|---|---|
| ingest | DE | stale CDC stream | source_lag_seconds |
| chunk_embed | DE + ML | chunk size too small | chunks_per_doc |
| index | DE | missing metadata | metadata_coverage_pct |
| retrieve_rerank | DE + ML | embed model mismatch | recall_at_5 |
The trace highlights that three of four stages are DE-owned outright, and the fourth is shared. The senior signal in this answer is naming the metadata sidecar coverage as the biggest preventable failure — most candidates focus on the embedding model and miss that the index schema is where the multi-tenant safety lives.
Output:
| Stage | DE value-add | Single biggest win |
|---|---|---|
| ingest | source connectors + CDC | sub-minute source lag |
| chunk_embed | strategy per content type | semantic + hierarchical |
| index | metadata schema + ACL | 100% tenant_id coverage |
| retrieve_rerank | hybrid + filter pushdown | dense + BM25 fusion |
Why this works — concept by concept:
- Stage ownership matrix — naming the owner per stage frames RAG as a data product, not a model deployment. Interviewers reward this framing because it predicts how the candidate will run on-call rotations.
- Telemetry per stage — the senior move is to attach an observable metric to each stage so the on-call rotation can localise failures in seconds. Source lag, metadata coverage, recall@5 are the canonical three.
-
Metadata sidecar is the safety story — every chunk carries
tenant_id,source,ACL,last_modified— these are not optional, they are the only thing keepingretrieval augmented generationfrom cross-tenant leakage and stale answers. -
Hybrid + rerank is the precision story — dense retrieval alone misses on rare keywords; BM25 alone misses on synonyms; fusion + rerank is the production default for
hybrid search. - Cost — ingest is O(docs) one-time + O(changes) per CDC tick; embed is O(chunks) at write time; retrieve is O(log N) ANN + O(top-N) rerank per query. Reranker dominates query latency; embed dominates ingest latency.
DE
Topic — ETL
ETL pipeline problems (DE)
2. End-to-end RAG pipeline architecture
The four-stage rag data pipeline — ingest, chunk + embed, index, retrieve + rerank — and the metadata sidecar that travels with every chunk
The mental model in one line: a production rag pipeline is two pipelines joined at a vector store — an offline batch+CDC pipeline that ingests, chunks, embeds, and upserts; and an online request pipeline that embeds the query, searches, reranks, and assembles the prompt. Once you draw those two flows on a whiteboard, every RAG architecture question collapses into "where does this responsibility sit?"
The offline ingest pipeline.
- Source connectors. One per source system. Confluence, Notion, Google Drive, S3, Postgres CDC, Slack export, GitHub repo, support ticket export. Each emits raw documents into a staging bucket with provenance metadata.
- Normalize + strip. Strip HTML / Markdown formatting, remove boilerplate (navigation, footers, code-of-conduct banners), expand attachments, OCR images if needed, dedupe near-identical pages.
- Split into logical units. Section-aware split: headings define sections, paragraphs define chunks within sections. Tables and code blocks are split as atomic units (do not chunk a table).
-
Chunker. Apply the per-content-type chunking strategy (covered in detail in section 3). Output is
(doc_id, chunk_idx, text, metadata). - Embedder. Batch-embed chunks through the embedding model. Cache by content hash so unchanged chunks are not re-embedded on every run.
-
Vector store upsert. Upsert
(vector, chunk_id, metadata)into the vector store. The chunk text goes to a sidecar store (Postgres, DynamoDB, S3) keyed bychunk_id.
The online retrieval pipeline.
- Query embed. The same embedding model encodes the user query into a vector. Critical: write and read paths must use the same model.
- ANN search with metadata filter. Approximate-nearest-neighbour search returns top-N candidates restricted by the metadata filter (tenant_id, source, ACL, recency window).
- Score fusion (hybrid). Dense ANN scores are fused with a sparse BM25 score from a parallel inverted index. Weighted sum or Reciprocal Rank Fusion (RRF) are the two common shapes.
- Reranker. A cross-encoder model takes the query + each of the top-N candidates and produces a precision-tuned score. Top-k after rerank are the chunks that go into the prompt.
-
Prompt assembly. Top-k chunks are concatenated with provenance headers (
"Source: confluence/page-123, modified 2026-06-10") so the LLM can cite sources.
The metadata sidecar — what every chunk carries.
-
tenant_id— for multi-tenant SaaS, every retrieval is filtered bytenant_id = ?. No exceptions. -
source— where the chunk came from (confluence/page-123,s3://bucket/key,postgres://schema.table#row). Powers attribution and trust signals. -
document_id— group chunks back to their parent document for "show me the full doc" follow-ups. -
last_modified— for freshness SLO measurement and time-window filters. -
acl_ids— list of permission tags that the retrieval filter intersects with the user's permissions. -
content_type—prose,code,table,transcript— used to pick the right reranker or trigger content-specific rules. -
embed_model_id— the embedding model version that produced this vector. Critical for blue/green model swaps.
Observability — what to graph.
-
Source lag. Per source, the P95 source-to-retrieval lag. The single most important SLO for
rag freshness. - Index hit rate. What fraction of queries return any chunks at all (low = empty store or over-filtered metadata).
- Fallback ratio. Fraction of queries that fall through to the "no context" prompt path. A sharp uptick is the first signal of a stalled embed worker.
- Recall@5 on golden set. Offline metric scored nightly against a curated Q&A set with known-correct chunk IDs.
- Reranker latency P95. Cross-encoders dominate query latency; alert on regressions.
The eval harness.
-
Golden Q&A set. 200-2000 curated
(question, expected_chunk_id, expected_answer)triples maintained by the product team plus SMEs. - Offline eval. Nightly run scores recall@k, MRR, faithfulness against the golden set. Drops trigger PagerDuty.
- Online eval (LLM-as-judge). Sample of live queries scored by a stronger LLM against rubric. Drift detection.
- Shadow traffic. New embedding model or chunker runs in shadow mode against live queries before the swap — compare top-5 overlap and recall before promoting.
Worked example — building the offline ingest stage in 50 lines
Detailed explanation. The minimal happy path: pull a document, normalize, chunk with overlap, embed in a batch, upsert with metadata. Everything in production is a hardening of these five steps — retries, dedup, content-hash caching, sidecar persistence.
Question. Sketch the offline ingest stage that takes a Confluence page, splits it into 500-token chunks with 75-token overlap, embeds each chunk, and upserts into a vector store with a tenant-aware metadata sidecar.
Input. A single Confluence page object.
| field | value |
|---|---|
| page_id | "PAGE-42" |
| tenant_id | "acme" |
| body_markdown | "## Refund policy\n\nAt Acme..." (~3500 tokens) |
| last_modified | "2026-06-10T09:14:00Z" |
Code.
import hashlib
from typing import Iterable
CHUNK_TOKENS = 500
OVERLAP_TOKENS = 75
def chunk_text(text: str, size: int, overlap: int) -> Iterable[str]:
"""Sliding-window chunker — tokens approximated as whitespace splits."""
tokens = text.split()
step = size - overlap
for i in range(0, len(tokens), step):
yield " ".join(tokens[i : i + size])
if i + size >= len(tokens):
break
def ingest_page(page: dict) -> None:
chunks = list(chunk_text(page["body_markdown"], CHUNK_TOKENS, OVERLAP_TOKENS))
# Batch-embed all chunks in one call — 10-50x cheaper than per-chunk
vectors = embed_model.encode(chunks)
rows = []
for idx, (text, vec) in enumerate(zip(chunks, vectors)):
chunk_id = f"{page['page_id']}#chunk-{idx}"
rows.append({
"id": chunk_id,
"vector": vec,
"metadata": {
"tenant_id": page["tenant_id"],
"source": f"confluence/{page['page_id']}",
"document_id": page["page_id"],
"last_modified": page["last_modified"],
"chunk_idx": idx,
"content_hash": hashlib.sha256(text.encode()).hexdigest(),
"embed_model_id": embed_model.id,
},
})
text_store.put(chunk_id, text) # sidecar — chunk text in Postgres / S3
vector_store.upsert(rows) # vectors + metadata
Step-by-step explanation.
- The chunker is a sliding-window splitter — 500 tokens forward, 75 tokens back, so each chunk shares ~15% with its neighbours. Overlap is what saves you when the answer straddles a chunk boundary.
-
embed_model.encode(chunks)is called once with the full batch. Per-chunk calls are 10-50x more expensive due to per-request overhead and HTTP round-trips. - Every chunk gets a deterministic
chunk_idderived from the document id and chunk index, so re-ingesting the same page produces the same IDs —upsertoverwrites instead of duplicating. - The metadata sidecar carries every filter the retrieval path will ever need:
tenant_id(multi-tenancy),sourceanddocument_id(attribution),last_modified(freshness),chunk_idx(sibling lookup),content_hash(dedup / skip-unchanged), andembed_model_id(blue/green safety). - The chunk text goes into a sidecar text store keyed by
chunk_id. The vector store only stores the vector — sidecar lookup happens at the prompt-assembly stage. - Idempotency: re-ingesting the same page is a no-op for unchanged chunks (same hash, same vector, same metadata) and a one-row update for changed chunks.
Output.
| chunk_id | tenant_id | text snippet | vector dims |
|---|---|---|---|
| PAGE-42#chunk-0 | acme | "## Refund policy At Acme..." | 1536 |
| PAGE-42#chunk-1 | acme | "...within 14 days of purchase..." | 1536 |
| PAGE-42#chunk-2 | acme | "...exceptions for digital goods..." | 1536 |
Rule of thumb. Every chunk needs three things from day one: a deterministic ID, a content hash, and a metadata sidecar that includes tenant_id, source, and last_modified. Bolt them on later and you are rewriting the ingest job, not patching it.
Worked example — the online retrieval stage end-to-end
Detailed explanation. Once the ingest stage has populated the index, the online path is a tight five-step pipeline: embed query, ANN search with metadata filter, BM25 fusion, rerank, assemble prompt. The whole loop should run in <300ms P95 to feel responsive.
Question. Sketch the online retrieval stage that takes a user query, applies the tenant filter, fuses dense + BM25 scores, reranks the top-50 with a cross-encoder, and returns the top-5 chunks plus assembled prompt.
Input. A single user query plus the calling user's tenant and ACL list.
| field | value |
|---|---|
| query | "What is Acme's refund window for digital goods?" |
| tenant_id | "acme" |
| acl_ids | ["public", "support"] |
Code.
def retrieve_and_rerank(query: str, tenant_id: str, acl_ids: list[str]) -> dict:
# 1) Embed the query with the SAME model used at ingest
qvec = embed_model.encode([query])[0]
# 2) ANN search with metadata filter — tenant + ACL pushdown
candidates = vector_store.search(
vector=qvec,
top_k=50,
filter={
"tenant_id": tenant_id,
"acl_ids": {"$in": acl_ids},
},
)
# 3) Hybrid score fusion — Reciprocal Rank Fusion (RRF) with k=60
bm25_hits = bm25_index.search(query, top_k=50, filter={"tenant_id": tenant_id})
fused = rrf_fuse(candidates, bm25_hits, k=60)
# 4) Cross-encoder rerank on the top 50 fused candidates
pairs = [(query, text_store.get(c.id)) for c in fused[:50]]
rerank_scores = reranker.score(pairs)
top5 = sorted(zip(fused[:50], rerank_scores), key=lambda x: -x[1])[:5]
# 5) Assemble the prompt with provenance headers
blocks = []
for cand, _ in top5:
m = cand.metadata
text = text_store.get(cand.id)
blocks.append(
f"[source: {m['source']} · modified {m['last_modified']}]\n{text}"
)
return {
"prompt_context": "\n\n---\n\n".join(blocks),
"citations": [c.metadata["source"] for c, _ in top5],
}
def rrf_fuse(dense, sparse, k=60):
"""Reciprocal Rank Fusion — combine two ranked lists by 1/(k+rank)."""
scores: dict[str, float] = {}
for rank, hit in enumerate(dense):
scores[hit.id] = scores.get(hit.id, 0) + 1 / (k + rank + 1)
for rank, hit in enumerate(sparse):
scores[hit.id] = scores.get(hit.id, 0) + 1 / (k + rank + 1)
return sorted(
{c.id: c for c in dense + sparse}.values(),
key=lambda c: -scores[c.id],
)
Step-by-step explanation.
- The query is embedded with the same model that produced the index vectors. If the model IDs disagree, vectors are in different spaces and cosine similarity is noise — this is the silent killer covered in section 1.
- The ANN search pushes the tenant filter down into the index — it is not a post-filter. Pinecone, Qdrant, Weaviate, and pgvector all support metadata predicate pushdown, and using it is the difference between a 10ms query and a 500ms query at scale.
- The BM25 search runs in parallel against a sparse inverted index (Elasticsearch, OpenSearch, Lucene) with the same tenant filter. Dense + sparse together is the
hybrid searchshape. - RRF fuses the two ranked lists by summing
1/(k+rank)— no need to normalize scores, no per-system weight tuning.k=60is the standard starting value from the IR literature. - The cross-encoder reranker takes
(query, chunk_text)pairs and produces a precision-tuned score. It is 50-100x slower per pair than ANN, so it only runs on the top 50 — never the full corpus. - Top-5 reranked chunks become the prompt context. Each block carries a
sourceandlast_modifiedheader so the LLM can quote the source and the consumer can audit freshness.
Output.
| step | result |
|---|---|
| ANN top-50 (dense) | 50 candidates |
| BM25 top-50 (sparse) | 50 candidates |
| RRF fused | ~70 unique candidates |
| Reranked top-5 | 5 chunks for prompt |
| Prompt context | ~2500 tokens with citations |
Rule of thumb. The online stage is a fixed five-step pipeline: embed → ANN+filter → BM25+fusion → rerank → assemble. Every step has a 100ms budget; every step has its own telemetry. If P95 latency drifts, the slow step is almost always the reranker — cap the rerank candidate set and tune k.
Worked example — the eval harness golden-set scoring loop
Detailed explanation. A golden Q&A set is the only reliable way to know whether a pipeline change improved or hurt retrieval. The shape: a CSV / Parquet of (question, expected_chunk_id, expected_answer) triples maintained by SMEs. The harness runs every nightly job and scores recall@k and MRR. A sustained drop pages the on-call.
Question. Sketch a nightly offline eval that scores the current pipeline against a golden set and reports recall@5 and mean reciprocal rank (MRR) at 10.
Input. A 500-row golden set.
| question_id | question | expected_chunk_id |
|---|---|---|
| g001 | "Refund window for digital goods?" | PAGE-42#chunk-1 |
| g002 | "How long is the trial period?" | PAGE-09#chunk-3 |
Code.
def score_pipeline(golden: list[dict], k: int = 5) -> dict:
recall_hits = 0
rr_sum = 0.0
for q in golden:
hits = retrieve_and_rerank(
query=q["question"],
tenant_id=q.get("tenant_id", "default"),
acl_ids=q.get("acl_ids", ["public"]),
)
ids = [h.id for h in hits["chunks"][:10]]
if q["expected_chunk_id"] in ids[:k]:
recall_hits += 1
try:
rank = ids.index(q["expected_chunk_id"]) + 1
rr_sum += 1 / rank
except ValueError:
pass # not in top-10 → reciprocal rank contributes 0
return {
"recall_at_k": recall_hits / len(golden),
"mrr_at_10": rr_sum / len(golden),
"n": len(golden),
}
# Nightly job
result = score_pipeline(load_golden_set(), k=5)
publish_metric("rag.recall_at_5", result["recall_at_k"])
publish_metric("rag.mrr_at_10", result["mrr_at_10"])
if result["recall_at_k"] < 0.85:
page_oncall("recall@5 regression", result)
Step-by-step explanation.
- For each golden question, the harness runs the real online retrieval pipeline — same embed, same filter, same rerank — against the live index. No mocking.
-
recall_at_kis "did the gold chunk appear in the top-k?" averaged across the golden set. The standard production threshold is 0.85-0.95 depending on stakes. - MRR at 10 is "the average of 1/rank over the golden set, with rank=∞ contributing 0." MRR rewards getting the right chunk high in the ranking — it is a precision-tilted recall metric.
- Both metrics are published to the metric store and trended over time. A sustained drop ≥5 points triggers PagerDuty and a rollback investigation.
- The harness is the single source of truth on whether a chunking change, an embedding model upgrade, or a reranker swap helped. Without it, every change is a vibe-based decision.
Output.
| metric | value | threshold |
|---|---|---|
| recall@5 | 0.91 | ≥ 0.85 |
| MRR@10 | 0.74 | ≥ 0.65 |
| n | 500 | — |
Rule of thumb. No golden set, no production RAG. Build a 200-row golden set on day one; grow it to 1000-2000 as edge cases emerge. The harness is what turns RAG from "vibes shipping" into a real data product with an SLO.
Data engineering interview question on online vs offline pipelines
A senior interviewer might frame this as: "Draw the offline ingest and online retrieval pipelines on a whiteboard, then mark where each one can fail at 3am and how you would alert on it." It probes both system design and on-call instincts.
Solution Using a two-pipeline diagram + failure-mode catalogue
# Two pipelines joined at the vector store
OFFLINE = [
"source_connector",
"normalize",
"chunk",
"embed",
"vector_upsert",
]
ONLINE = [
"query_embed",
"ann_search_with_filter",
"bm25_search",
"rrf_fuse",
"rerank",
"assemble_prompt",
]
FAILURES = {
"source_connector": "auth expired / source down → source_lag_seconds spike",
"chunk": "boundary mid-sentence → recall@5 drop on golden set",
"embed": "model API down → embed_queue_depth grows unbounded",
"vector_upsert": "schema mismatch → upsert_error_rate spike",
"ann_search_with_filter": "metadata not indexed → P95 latency spike",
"rerank": "model timeout → fallback_ratio spike",
}
Step-by-step trace.
| Pipeline | Stage | Failure signal | Alert |
|---|---|---|---|
| offline | source_connector | source_lag_seconds > 600 | page |
| offline | embed | embed_queue_depth growing | page |
| offline | vector_upsert | upsert_error_rate > 1% | page |
| online | ann_search | P95 latency > 200ms | page |
| online | rerank | fallback_ratio > 5% | page |
| online | assemble_prompt | empty_context_ratio > 2% | warn |
The trace highlights that offline and online have orthogonal failure modes — an offline stall produces a staleness failure; an online stall produces a latency or quality failure. Each needs its own SLO.
Output:
| Pipeline | Top SLO | Alert threshold |
|---|---|---|
| offline ingest | P95 source-to-retrieval lag | < 5 min |
| online retrieve | P95 end-to-end latency | < 300 ms |
| online quality | recall@5 (nightly) | ≥ 0.85 |
Why this works — concept by concept:
- Two pipelines, one vector store — the offline path is throughput-bound (catch up on backlog); the online path is latency-bound (respond in <300ms). Splitting them lets you scale the embed worker independently from the retrieval service.
-
SLO per pipeline — the offline SLO is a freshness number (
source_lag_seconds); the online SLO is a latency number (P95_response_time); the quality SLO is an offline metric (recall@5). All three matter. -
Filter pushdown — the metadata filter (
tenant_id,acl_ids) is pushed into the ANN search, not applied after. This is the single biggest performance lever in the online path. - Reranker is the precision lever — without it, you get hybrid-search recall but consumer-noisy ranking. With it, top-5 precision climbs sharply at the cost of ~50-100ms.
- Cost — offline: O(docs) one-time + O(changes) per CDC tick + O(chunks) embedding cost. Online: one embed call + one ANN query + one BM25 query + one rerank batch per request. Reranker dominates online cost; embedding dominates offline cost.
DE
Topic — streaming
Streaming pipeline problems (DE)
3. Chunking strategies — fixed, semantic, recursive, hierarchical
chunking strategies decide what the retriever can ever return — pick the strategy per content type, not per project
The mental model in one line: a chunk is the smallest unit your retriever can return, so the chunk shape is the upper bound on retrieval precision — split mid-sentence and the model gets half an idea; split mid-paragraph and you lose the connective tissue between claims. Once you say "the chunk is the retrieval unit," you stop optimising token counts and start optimising meaningful boundaries.
The four families.
- Fixed token windows. Pick a target size (e.g. 500 tokens) and a stride. Cheap, deterministic, dumb at boundaries. Default starting point — beat it before adding complexity.
-
Recursive character / token splitter. Try to split by paragraph; if too big, fall back to sentence; if still too big, fall back to a fixed token window. LangChain's
RecursiveCharacterTextSplitterpopularised this shape. - Semantic chunking. Embed each sentence and split where the cosine similarity between adjacent sentences drops below a threshold — i.e. where the topic visibly shifts. Higher quality, ~2-3x more expensive at ingest.
- Hierarchical / parent-child. Index small child chunks (for high-recall retrieval) but at prompt time return the parent chunk (a paragraph or section) that contains the matched child. The model gets context; the retriever gets precision.
Overlap windows.
- What overlap is. Each chunk shares its first N tokens with the previous chunk's last N tokens. 10-20% is the standard range (~50-100 tokens for a 500-token chunk).
- Why it matters. A claim that straddles a chunk boundary appears in both chunks instead of being orphaned. Overlap is the cheapest insurance against boundary loss.
- When to skip it. Atomic content (a code block, a table, a paragraph) does not need overlap — it is already its own unit.
Per-content-type strategy table.
| Content type | Recommended strategy | Notes |
|---|---|---|
| Prose / docs | recursive or semantic, 400-600 tokens, 15% overlap | semantic if budget allows |
| Code | one chunk per function / class (AST split) | never split mid-function |
| Tables / structured | one chunk per table or per row group | preserve column headers |
| Transcripts / chat | one chunk per turn or per N seconds | speaker label as metadata |
| FAQs | one chunk per Q+A pair | the Q is the retrieval signal |
| Long PDFs (manuals) | hierarchical: child = paragraph, parent = section | retrieve child, serve parent |
Common chunking interview probes.
- "What is the right chunk size?" — there is no universal right; start at 500 tokens with 75 overlap, then tune against the golden set. Senior signal: name "embedding model context window" as the hard upper bound (most embedding models are 512 tokens) and "downstream LLM context budget" as the soft constraint.
- "When do you use semantic chunking?" — when prose has shifting topics within a single document (long-form blogs, transcripts, multi-topic reports). Skip it for tightly-scoped reference docs where fixed windows match natural paragraph length.
- "What is parent-child chunking?" — index small chunks for high-recall ANN, but at prompt time return the parent chunk (a paragraph, section, or sliding window) so the LLM gets enough context. Standard shape on long-form RAG.
- "How do you chunk code?" — never mid-function. Use the language's AST to split at function and class boundaries; index the function signature + docstring + body as one chunk.
Worked example — fixed-window vs recursive splitter on the same document
Detailed explanation. A fixed-window splitter is the simplest possible chunker — slide a window of N tokens with stride S. It is fast, but it cheerfully cuts sentences and paragraphs in half. A recursive splitter tries semantic boundaries first (paragraphs, then sentences) before falling back to a fixed window, so most splits land at natural boundaries.
Question. Implement a fixed-window splitter and a recursive splitter for a 1200-token document. Compare the splits and show why the recursive version preserves more semantic units.
Input. A document with three paragraphs (400 + 350 + 450 tokens, separated by \n\n).
| paragraph | tokens |
|---|---|
| P1 — refund policy | 400 |
| P2 — exceptions | 350 |
| P3 — appeals | 450 |
Code.
def fixed_window(text: str, size: int = 500, overlap: int = 75) -> list[str]:
tokens = text.split()
out, step = [], size - overlap
for i in range(0, len(tokens), step):
chunk = tokens[i : i + size]
if not chunk:
break
out.append(" ".join(chunk))
if i + size >= len(tokens):
break
return out
def recursive_split(text: str, size: int = 500, overlap: int = 75) -> list[str]:
"""Try paragraph → sentence → fixed-window fallback."""
# 1) Try paragraph split first
paras = text.split("\n\n")
out: list[str] = []
for p in paras:
ptokens = p.split()
if len(ptokens) <= size:
out.append(p)
continue
# 2) Paragraph too big → fall back to sentence split
sents = re.split(r"(?<=[.!?])\s+", p)
buf: list[str] = []
buf_tok = 0
for s in sents:
st = s.split()
if buf_tok + len(st) > size and buf:
out.append(" ".join(buf))
buf, buf_tok = [], 0
buf.append(s)
buf_tok += len(st)
if buf:
out.append(" ".join(buf))
# 3) Sentence-level chunks still too big? fall back to fixed window
out = [c if len(c.split()) <= size else fixed_window(c, size, overlap)
for c in out]
# Flatten any nested lists from step 3
flat: list[str] = []
for c in out:
flat.extend(c) if isinstance(c, list) else flat.append(c)
return flat
Step-by-step explanation.
- Fixed-window on the 1200-token document with size=500, overlap=75 produces three chunks: tokens 0-500, 425-925, 850-1200. The first chunk ends mid-paragraph 2 (token 500 is inside P2). The second chunk starts mid-paragraph 2. Boundary loss.
- Recursive split tries paragraph boundaries first. P1 (400 tokens) is ≤500, so it becomes one chunk. P2 (350 tokens) is ≤500, so it becomes one chunk. P3 (450 tokens) is ≤500, so it becomes one chunk. Three clean paragraph-aligned chunks, zero mid-sentence splits.
- If a paragraph exceeds 500 tokens, the recursive splitter falls back to sentence split — still semantic, just one level finer. Only paragraphs and sentences that exceed the size fall back to the dumb fixed-window splitter.
- The recursive version produces more chunks on average (one per paragraph instead of one per 500-token window), but every chunk respects a natural boundary. Retrieval precision improves at the cost of slightly more chunks to index.
Output.
| Strategy | # chunks | mid-sentence splits |
|---|---|---|
| fixed_window(500, 75) | 3 | 2 (inside P2) |
| recursive_split(500, 75) | 3 | 0 |
Rule of thumb. Default to recursive splitting for prose — the cost over a fixed window is negligible (one extra pass over the text) and the recall lift is consistent. Reserve plain fixed-window for tightly-controlled content types where the natural unit is already the window size (timeseries, log lines, code lines).
Worked example — semantic chunking by similarity drop
Detailed explanation. Semantic chunking embeds each sentence and walks the document looking for drops in cosine similarity between adjacent sentences — those drops mark topic shifts. A chunk is the run of sentences between two drops. The cost is one extra embedding call per sentence at ingest, but the chunks line up with topical boundaries that no syntactic splitter can find.
Question. Implement a semantic chunker that splits a document at sentence boundaries where the cosine similarity between consecutive sentences drops below a threshold.
Input. A 6-sentence document that shifts topic twice.
| sentence_idx | text |
|---|---|
| 0 | "Refund policy: we accept returns within 14 days." |
| 1 | "Returns must be in original packaging." |
| 2 | "Shipping is free on orders over $50." |
| 3 | "We use FedEx and UPS for delivery." |
| 4 | "Our support hours are 9 to 5 EST." |
| 5 | "Reach us via email or chat." |
Code.
def semantic_chunk(text: str, threshold: float = 0.55) -> list[str]:
sents = re.split(r"(?<=[.!?])\s+", text)
if len(sents) < 2:
return sents
vecs = embed_model.encode(sents)
boundaries = [0] # split *before* these indices
for i in range(1, len(vecs)):
sim = cosine(vecs[i - 1], vecs[i])
if sim < threshold:
boundaries.append(i)
boundaries.append(len(sents))
chunks: list[str] = []
for a, b in zip(boundaries[:-1], boundaries[1:]):
chunks.append(" ".join(sents[a:b]))
return chunks
def cosine(a, b):
import math
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
return dot / (na * nb + 1e-9)
Step-by-step explanation.
- Each sentence is embedded once at ingest time — this is the cost driver. For a 100-sentence document, semantic chunking issues 100 embed calls instead of zero (fixed-window).
- Adjacent sentences are compared pairwise. The cosine similarity between sentence 1 ("packaging") and sentence 2 ("shipping") is below 0.55 → boundary inserted before sentence 2.
- The walk continues: sentence 3 vs 4 similar (both about shipping), no split. Sentence 4 vs 5 below 0.55 (shipping → support hours), boundary inserted.
- Three chunks emerge:
[0,1] refund/returns,[2,3] shipping,[4,5] support. Each chunk is a topical unit, even though they all sit in one source document. - Threshold is tuned on the golden set. Too high → too many chunks (every sentence is its own chunk); too low → no splits at all. 0.45-0.65 is the typical range with OpenAI / Cohere embeddings.
Output.
| chunk_idx | sentences | topic |
|---|---|---|
| 0 | 0, 1 | refund / returns |
| 1 | 2, 3 | shipping |
| 2 | 4, 5 | support |
Rule of thumb. Use semantic chunking when documents mix topics (long-form blogs, all-hands transcripts, multi-section policy docs). Skip it when each source document is already a single tight topic — the marginal recall gain does not pay for the ingest cost.
Worked example — hierarchical (parent-child) chunking
Detailed explanation. Index small chunks (sentences or short paragraphs) for high-recall ANN matching, but at prompt assembly time return the parent chunk (a section or full paragraph) that contains the matched child. The retriever gets to match on tight semantic units; the LLM gets enough surrounding context to answer.
Question. Implement a parent-child chunker that indexes sentence-level child chunks but returns the paragraph-level parent at retrieval time.
Input. A 3-paragraph document where each paragraph has 2-3 sentences.
| paragraph_idx | sentence_idx | text |
|---|---|---|
| 0 | 0 | "Refunds are accepted within 14 days." |
| 0 | 1 | "Returns must be in original packaging." |
| 1 | 0 | "Digital goods are non-refundable." |
| 1 | 1 | "Subscriptions cancel at period end." |
| 2 | 0 | "Contact support at help@acme.com." |
Code.
def parent_child_chunk(doc_id: str, text: str) -> tuple[list[dict], dict]:
parents: dict[str, str] = {}
children: list[dict] = []
paras = text.split("\n\n")
for p_idx, para in enumerate(paras):
parent_id = f"{doc_id}#para-{p_idx}"
parents[parent_id] = para
sents = re.split(r"(?<=[.!?])\s+", para)
for s_idx, sent in enumerate(sents):
children.append({
"id": f"{doc_id}#sent-{p_idx}-{s_idx}",
"text": sent,
"parent_id": parent_id,
})
return children, parents
def retrieve_with_parent(query: str, top_k: int = 5) -> list[str]:
# 1) ANN search over the child (sentence) index
child_hits = vector_store.search(query=query, top_k=top_k * 3)
# 2) Resolve to parent chunks, dedupe — same parent counted once
seen, out = set(), []
for c in child_hits:
pid = c.metadata["parent_id"]
if pid in seen:
continue
seen.add(pid)
out.append(parent_store.get(pid))
if len(out) == top_k:
break
return out
Step-by-step explanation.
- At ingest, every sentence becomes a child chunk with a vector, and every paragraph becomes a parent record (text only — no vector needed because parents are never directly searched).
- Each child carries a
parent_idin its metadata sidecar so the retrieval path can resolve from match back to context. - At query time, the ANN search runs over the child index — short, semantically tight units that match queries crisply.
- The result list of child matches is collapsed by
parent_id(dedupe) and the parent paragraph text is returned. If sentences 0 and 1 of paragraph 0 both match, the paragraph is returned once. -
top_k * 3over-retrieves on the child side because multiple children from the same parent may match — over-fetch lets you still emittop_kunique parents.
Output (for query "How long for refunds?").
| match | child_id | parent returned |
|---|---|---|
| 1st hit | doc#sent-0-0 | "Refunds are accepted within 14 days. Returns must be in original packaging." |
Rule of thumb. Use parent-child when chunks need to be small enough for tight retrieval (sentence-level) but the LLM needs surrounding context to answer (paragraph or section level). It is the default shape for technical docs, legal docs, and long-form policy.
Worked example — overlap window math and why 15% is the default
Detailed explanation. Overlap is the cheapest insurance against a fact landing across a chunk boundary. The math is simple: with overlap O on a chunk of size S, every claim within the first O tokens of a chunk also appears in the previous chunk; every claim within the last O tokens also appears in the next chunk. A claim has two shots at being returned.
Question. Given chunk size 500 tokens and overlap 75 tokens, what fraction of the document appears in two chunks? When is overlap worth it and when is it waste?
Input. A document of 5000 tokens.
Code.
def overlap_stats(doc_tokens: int, size: int, overlap: int) -> dict:
step = size - overlap
n_chunks = (doc_tokens - overlap) // step + (1 if (doc_tokens - overlap) % step else 0)
total_tokens_stored = n_chunks * size
redundant = total_tokens_stored - doc_tokens
return {
"n_chunks": n_chunks,
"total_tokens_stored": total_tokens_stored,
"redundant_tokens": redundant,
"overlap_pct": overlap / size,
"storage_overhead_pct": redundant / doc_tokens,
}
for overlap in (0, 50, 75, 100, 200):
print(overlap, overlap_stats(5000, 500, overlap))
Step-by-step explanation.
- Step = size - overlap = 500 - 75 = 425. To cover 5000 tokens, you need
ceil((5000 - 75) / 425) ≈ 12chunks (vs 10 chunks with zero overlap). - Total tokens stored = 12 × 500 = 6000. Doc has 5000 unique tokens. Redundant tokens = 1000, which is exactly 2 × (chunks - 1) × overlap = 2 × 11 × ... approximated by
(n_chunks - 1) * overlapoverhead. - Storage overhead at 15% overlap is ~20% extra tokens stored (and embedded, and indexed). That is the cost.
- The benefit: every fact within 75 tokens of a chunk boundary now appears in two chunks, doubling its chance of being retrieved.
- 15% (~75 tokens on a 500-token chunk) is the empirical sweet spot — below 10% leaves obvious boundary gaps; above 25% pays cost without much extra recall.
Output.
| overlap | n_chunks | redundant tokens | storage overhead |
|---|---|---|---|
| 0 | 10 | 0 | 0% |
| 50 | 11 | 500 | 10% |
| 75 | 12 | 1000 | 20% |
| 100 | 13 | 1500 | 30% |
| 200 | 17 | 3500 | 70% |
Rule of thumb. Start at 15% overlap (75 tokens on a 500-token chunk) — it is the empirically calibrated default. Drop to 0 only for atomic content (a row, a function, a Q+A pair) where there is no boundary to worry about. Past 25% you pay storage and ingest cost without commensurate recall gain.
Data engineering interview question on picking a chunking strategy
A senior interviewer might frame this as: "You inherit a RAG system with 100k mixed documents — long-form policy PDFs, support tickets, code samples, transcripts. The current chunker is a fixed 1000-token window with no overlap. Recall@5 is 0.62 on the golden set. Where do you start?"
Solution Using a per-content-type chunking strategy
def chunk_dispatch(doc: dict) -> list[dict]:
"""Pick the right chunker per content type — strategy is not one-size-fits-all."""
content_type = doc["content_type"]
if content_type == "policy_pdf":
# Long-form, multi-topic → hierarchical (sentence child, paragraph parent)
return parent_child_chunker(doc, child_size=120, parent_size=600, overlap=15)
if content_type == "support_ticket":
# Short, conversational → one chunk per ticket, no overlap
return [{"text": doc["body"], "metadata": doc["metadata"]}]
if content_type == "code":
# AST-aware split at function / class boundaries
return ast_chunker(doc, language=doc["language"])
if content_type == "transcript":
# Topic-shift aware → semantic chunking
return semantic_chunker(doc, threshold=0.55)
# Default for plain prose
return recursive_chunker(doc, size=500, overlap=75)
Step-by-step trace.
| content_type | strategy | starting params | expected recall@5 lift |
|---|---|---|---|
| policy_pdf | hierarchical | child 120 / parent 600 | +0.15-0.20 |
| support_ticket | atomic | one chunk per ticket | +0.05-0.10 |
| code | AST split | per function / class | +0.10-0.15 |
| transcript | semantic | threshold 0.55 | +0.10-0.15 |
| plain prose | recursive | 500 / 75 overlap | +0.05 baseline |
The trace highlights that a single strategy across all content types is the bug, not a missing feature. Switching from fixed-1000 to per-type strategies typically lifts recall@5 by 0.15-0.25 in aggregate — the single largest one-shot win in any RAG project.
Output:
| Metric | Before (fixed 1000) | After (per-type) |
|---|---|---|
| recall@5 | 0.62 | ~0.83 |
| MRR@10 | 0.41 | ~0.63 |
| avg chunks per doc | 5.2 | 11.7 |
Why this works — concept by concept:
- Per-content-type dispatch — the chunk shape that maximises recall depends entirely on the content shape. Code wants AST boundaries; policy wants section boundaries; tickets are already atomic; transcripts shift topic. One chunker for all four is a contradiction.
- Hierarchical for long-form — small children give tight retrieval; large parents give the LLM enough context. The standard winning shape for technical docs.
- Semantic for shifting topics — costs more at ingest but pays back on multi-topic documents (transcripts, blogs). Skip on tight-scope docs.
- AST for code — splitting mid-function loses the function signature or the body. AST-aware chunking pairs them — interviewers love this answer because it shows content-aware thinking.
- Cost — fixed-window is O(doc_tokens); recursive adds ~1 pass; semantic adds one embed call per sentence; hierarchical doubles the storage (children + parents). The recall lift typically pays back within 1 week of token cost.
DE
Topic — data transformation
Data transformation problems (DE)
4. Embeddings + storage — choosing models and shaping the index
embeddings decide what "similar" means in your index — and the metadata sidecar decides whether the right user is allowed to see it
The mental model in one line: an embedding model is a learned similarity function — it determines which chunks the retriever calls "close" — and a vector stores schema is the index plus the metadata sidecar that keeps that similarity safe to ship. Once you say "the embedding decides recall, the metadata decides safety," every RAG storage question fits into one of those two columns.
Embedding model selection — the four axes.
-
Quality (recall@k on MTEB). OpenAI
text-embedding-3-large, Cohereembed-v3, and OSS models like BGE-large and e5-mistral lead the public benchmarks. The gap between top-tier hosted and top OSS narrowed sharply in 2025-2026. - Dimensions. 256-dim models are 6x cheaper to store and search than 1536-dim models, with single-digit-point recall trade-off. Most production teams ship 384-768 dim now.
- Cost. Per-million-token embedding cost. Hosted models charge ~$0.02-0.13 per million tokens; self-hosted OSS is GPU-amortised.
-
Latency. Per-batch latency at ingest. Critical for the freshness path — a stalled embedder is the most common cause of
rag freshnessSLO misses.
The "same model on read and write" rule.
- The vectors in the index were produced by model M; a query must be embedded by model M to be comparable. Different models live in different vector spaces — cosine similarity between them is noise.
- The fix when you want to upgrade: do a full re-embed into a new collection (covered in section 5 under blue/green).
- The pre-flight check: store
embed_model_idin the metadata sidecar of every chunk, and assert it matches the query embedder at retrieval time. Cheap insurance against a silent bug.
Batch vs streaming embedding.
- Batch. A nightly job re-embeds all changed chunks. Fine for low-freshness use cases (knowledge base of reference docs). Cheaper per-token.
- Streaming. CDC → Kafka → embed worker → vector upsert. Sub-minute lag. Required for high-freshness use cases (ticket triage, live policy lookup).
- Hybrid. Batch for bulk re-embeds; streaming for incremental change. Most production systems converge to this shape.
Metadata schema — the columns every chunk carries.
-
tenant_id— multi-tenant filter pushdown. Required. -
source— origin URI for attribution and trust signals. -
document_id— group children back to parent doc. -
last_modified— for freshness SLO and recency filters. -
acl_ids— list of permission tags intersected with the user's permissions. -
content_type—prose,code,table,transcript— drives content-specific behaviour. -
embed_model_id— embedding model version. The blue/green safety pin. -
content_hash— SHA-256 of chunk text. Powers skip-unchanged and dedupe.
Hybrid retrieval — dense + BM25.
- Dense alone. Great on semantic synonyms ("revenue" matches "income"). Misses on rare keywords ("RT-2718 error code") because rare tokens have weak embeddings.
- BM25 alone. Great on rare keywords. Misses on synonyms. Sensitive to vocabulary mismatch.
-
Hybrid (RRF or weighted). Recovers both regimes. The 2026 default for production RAG. The typical weighted formula:
0.6 * dense_score + 0.4 * bm25_scoreafter both are min-max normalised; the typical rank-based formula is Reciprocal Rank Fusion withk=60. - When weighted vs RRF. RRF when scores from the two systems are not directly comparable (different score scales); weighted when you have tuned weights from a held-out set.
Reranking — the precision multiplier.
- Bi-encoder (ANN). The embedding model is a bi-encoder: query and doc are embedded separately and compared by cosine. Fast (millions of vectors per second) but loses precision because the comparison is just a dot product.
-
Cross-encoder (reranker). A second model that takes
(query, doc_text)as a pair and outputs a relevance score. Far more expressive (full attention across query + doc) but slow. - The standard shape. ANN top-N → reranker → top-k. N is typically 50; k is typically 5. The reranker only fires on 50 candidates per query, so the wall-clock cost is bounded.
-
Models. Cohere
rerank-v3, BGE-reranker, mxbai-rerank. Hosted models add ~50-100ms; self-hosted are 30-200ms depending on GPU and batch size.
Worked example — picking embedding dimensions: 256 vs 768 vs 1536
Detailed explanation. Higher-dimension embeddings can express more nuance but pay for it in storage (4 bytes per dim × N chunks), index size, and query latency. Most teams over-pay for dimensions; a calibrated dim choice is one of the highest-leverage decisions in a RAG project.
Question. A team has 10M chunks. Compare storage and query cost across 256, 768, and 1536 dimensions. Show the ANN recall trade-off.
Input. 10M chunks, all float32 vectors.
Code.
def storage_estimate(n_chunks: int, dims: int) -> dict:
bytes_per_chunk = dims * 4 # float32
raw_bytes = n_chunks * bytes_per_chunk
return {
"dims": dims,
"bytes_per_chunk": bytes_per_chunk,
"raw_storage_gb": raw_bytes / 1e9,
"approx_search_relative_cost": dims,
}
for dims in (256, 384, 768, 1024, 1536):
print(storage_estimate(10_000_000, dims))
Step-by-step explanation.
- 10M chunks × 1536 dims × 4 bytes = 61.4 GB raw vectors. At 768 dims → 30.7 GB. At 256 dims → 10.2 GB. Six-fold storage difference between the extremes.
- ANN search cost is roughly proportional to dims for the distance computation (HNSW edge comparisons scale linearly). A 1536-dim query is ~6x more compute per node visited than a 256-dim query.
- Recall trade-off (from MTEB benchmarks): going from 1536 → 768 typically costs 0-3 points of recall@10. Going to 384 costs 3-6 points. Going to 256 costs 5-10 points but pays back 6x on storage and search.
- Matryoshka embeddings (variable-dim truncation, where lower-dim prefixes are themselves usable embeddings) let you store at 1536 and search at 384 — a popular 2026 pattern.
Output.
| dims | raw storage (10M) | rel. search cost | typical recall lost |
|---|---|---|---|
| 256 | 10.2 GB | 1.0× | 5-10 pts |
| 384 | 15.4 GB | 1.5× | 3-6 pts |
| 768 | 30.7 GB | 3.0× | 0-3 pts |
| 1024 | 40.9 GB | 4.0× | 0-2 pts |
| 1536 | 61.4 GB | 6.0× | reference |
Rule of thumb. Default to 384-768 dims for production RAG; reserve 1536 for cases where the golden-set recall difference justifies the 6x storage and search cost. Matryoshka embeddings (store 1536, search 384) are the best of both worlds when the embedding model supports them.
Worked example — hybrid retrieval with Reciprocal Rank Fusion
Detailed explanation. Dense and sparse retrieval miss on orthogonal vocabulary regimes. Reciprocal Rank Fusion combines them by rank rather than score — no need to normalise scales, no per-system weight tuning. The 2026 default for hybrid search.
Question. Implement RRF that fuses a dense ANN result list and a BM25 result list into a single ranked output. Compare with a naive weighted-score fusion on a small example.
Input. Dense and sparse top-5 for query "refund window digital".
| rank | dense_id | dense_score | sparse_id | sparse_score |
|---|---|---|---|---|
| 1 | A | 0.92 | C | 18.4 |
| 2 | B | 0.88 | A | 17.1 |
| 3 | C | 0.71 | D | 14.0 |
| 4 | D | 0.65 | B | 12.5 |
| 5 | E | 0.62 | F | 9.8 |
Code.
def rrf(dense: list[str], sparse: list[str], k: int = 60) -> list[str]:
scores: dict[str, float] = {}
for rank, doc_id in enumerate(dense):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, doc_id in enumerate(sparse):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores, key=lambda d: -scores[d])
def weighted(dense, sparse, w_dense=0.6, w_sparse=0.4):
# Both scores must be min-max normalised first
def norm(items, key):
vals = [v for _, v in items]
lo, hi = min(vals), max(vals)
return {k_: (v - lo) / (hi - lo + 1e-9) for k_, v in items}
d = norm(dense, "dense")
s = norm(sparse, "sparse")
ids = set(d) | set(s)
return sorted(ids, key=lambda i: -(w_dense * d.get(i, 0) + w_sparse * s.get(i, 0)))
dense_ids = ["A", "B", "C", "D", "E"]
sparse_ids = ["C", "A", "D", "B", "F"]
print("RRF:", rrf(dense_ids, sparse_ids, k=60))
Step-by-step explanation.
- RRF assigns each candidate a score of
1/(k + rank)for each list it appears in. Lower ranks get more weight. The score for a doc that appears in both lists is the sum of the two contributions. - For doc A: rank 1 in dense (score
1/61 = 0.0164), rank 2 in sparse (score1/62 = 0.0161). Total = 0.0325. - For doc C: rank 3 in dense (
1/63 = 0.0159), rank 1 in sparse (1/61 = 0.0164). Total = 0.0323. - A wins narrowly over C because it ranked higher in dense; both win over docs that appear in only one list because they get only one contribution.
- RRF needs no score normalisation. Weighted score fusion requires normalising the dense cosine similarities and the BM25 scores to a common range and tuning the weights on a held-out set. RRF is robust to both.
-
k=60is the value from the original RRF paper. Smaller k weights the top of each list more aggressively; larger k flattens the contribution. The default works well in practice.
Output (RRF top-5).
| rank | doc | rrf_score |
|---|---|---|
| 1 | A | 0.0325 |
| 2 | C | 0.0323 |
| 3 | B | 0.0318 |
| 4 | D | 0.0314 |
| 5 | E | 0.0164 |
Rule of thumb. Default to RRF for hybrid retrieval — it is robust, requires no tuning, and works whenever you can produce two ranked lists. Switch to weighted score fusion only if you have a held-out tuning set and the RRF result is leaving recall on the table.
Worked example — metadata pushdown vs post-filter
Detailed explanation. Multi-tenant RAG must restrict every retrieval to the calling tenant. Where the filter is applied matters enormously — pushdown into the ANN index is cheap; post-filter after ANN is catastrophic at scale.
Question. Show two implementations of a tenant-aware retrieve: one with metadata pushdown into the vector store, one with a post-filter on the application side. Quantify the cost difference.
Input. A 50M-chunk index where each tenant has ~50k chunks (1000 tenants).
Code.
# BROKEN — post-filter on the application side
def retrieve_postfilter(query: str, tenant_id: str, k: int = 5) -> list:
qvec = embed_model.encode([query])[0]
candidates = vector_store.search(vector=qvec, top_k=2000) # huge over-fetch
return [c for c in candidates if c.metadata["tenant_id"] == tenant_id][:k]
# CORRECT — push the filter down into the ANN search
def retrieve_pushdown(query: str, tenant_id: str, k: int = 5) -> list:
qvec = embed_model.encode([query])[0]
return vector_store.search(
vector=qvec,
top_k=k,
filter={"tenant_id": tenant_id},
)
Step-by-step explanation.
- With 50M chunks and 1000 tenants, a tenant's chunks are 0.1% of the index. To get a top-5 after filtering, the post-filter approach must over-fetch by a huge factor — empirically 2000-10000 candidates to reliably surface 5 from the target tenant.
- ANN search cost is O(log N) per query, but at large over-fetch the constants matter — fetching 10000 vs 5 candidates is roughly 100-1000x the wall-clock at scale.
- Pushdown lets the ANN index restrict the search graph traversal to nodes that match the filter — modern vector stores (Pinecone, Qdrant, Weaviate, pgvector with
vchord/pgvecto.rs) support this natively. - Worst case for post-filter: the user has a "needle in haystack" tenant where the top 2000 ANN candidates contain zero of the tenant's chunks → user gets empty results despite having matching chunks in the index.
- Pushdown is also the only safe form — post-filter is a defence-in-depth layer at best, not the primary tenant boundary.
Output.
| Approach | Avg fetched | P95 latency | tenant safety |
|---|---|---|---|
| post-filter | 2000-10000 | 400-2000 ms | weak |
| pushdown | k (5) | 15-50 ms | strong |
Rule of thumb. Always push tenant and ACL filters down into the vector store; never rely on application-side post-filter. The performance difference is 1-2 orders of magnitude, and the safety boundary lives in exactly one place — the vector store query itself.
Worked example — cross-encoder reranker on top of hybrid
Detailed explanation. A cross-encoder reranker is the standard precision multiplier on top of hybrid retrieval. The shape is fixed: hybrid produces 50 candidates → reranker scores all 50 in one batch → top-5 are the prompt context. Adds ~50-100ms; lifts top-5 precision dramatically.
Question. Add a Cohere-style reranker to the hybrid pipeline. Show the batch call shape, the latency budget, and what happens when the reranker times out.
Input. A query plus 50 hybrid-fused candidate chunks.
Code.
def hybrid_then_rerank(query: str, tenant_id: str, k: int = 5) -> list:
# 1) Dense + sparse → fuse → 50 candidates
candidates = hybrid_search(query, tenant_id, top_k=50)
texts = [text_store.get(c.id) for c in candidates]
# 2) Cross-encoder batch — single call with all 50 pairs
try:
scores = reranker.score_batch(
query=query, documents=texts, timeout_ms=120
)
except TimeoutError:
# 3) Graceful degradation — return top-k from hybrid without rerank
log.warn("reranker_timeout", query_hash=hash(query))
metric("rag.rerank_timeout", 1)
return candidates[:k]
# 4) Sort by reranker score; return top-k
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
return [c for c, _ in ranked[:k]]
Step-by-step explanation.
- The reranker call is batched — all 50
(query, doc)pairs go in one HTTP request. Per-pair calls would multiply latency by ~50. - Latency budget: 120ms timeout. Cohere rerank-v3 typically lands at 60-100ms for 50 pairs; budget the timeout above the P99 to avoid spurious timeouts.
- Graceful degradation: if the reranker times out, fall back to the hybrid top-k. Never fail the whole query — degrade to "less precise but still answered."
- Metric
rag.rerank_timeoutis graphed and alerted on. A sustained timeout rate (≥5%) is the first sign the reranker is overloaded or the model API is unhealthy. - The reranker is the only online step that can be hot-swapped. Promote a new reranker behind a feature flag, run it shadow-style for a week against the golden set, then flip.
Output.
| step | latency budget | typical actual |
|---|---|---|
| query embed | 30 ms | 10-25 ms |
| ANN search (filtered) | 40 ms | 15-30 ms |
| BM25 search | 30 ms | 10-20 ms |
| RRF fuse | 5 ms | <2 ms |
| reranker batch (50) | 120 ms | 60-100 ms |
| prompt assemble | 20 ms | 5-15 ms |
| total | < 300 ms | 100-200 ms |
Rule of thumb. Budget the reranker on its own line. Cap candidate count at 50 (sometimes 100 for very high-stakes queries). Always wrap the call with a timeout and a graceful-degradation fallback to "hybrid without rerank" — never fail the user-facing query because the reranker hiccupped.
Data engineering interview question on storage schema and hybrid retrieval
A senior interviewer might frame this as: "You are designing the vector store schema for a multi-tenant retrieval augmented generation service. Walk me through the schema, why each column exists, and how the online retrieval query uses every column."
Solution Using a metadata-rich schema and pushdown filters
# Vector store row schema — every column has a job
SCHEMA = {
"id": "text PRIMARY KEY", # deterministic chunk_id
"vector": "vector(768)", # embedding
"tenant_id": "text NOT NULL", # multi-tenant pushdown
"source": "text NOT NULL", # attribution
"document_id": "text NOT NULL", # parent doc grouping
"chunk_idx": "int NOT NULL", # sibling lookup
"content_type": "text", # prose / code / table / transcript
"last_modified": "timestamptz", # freshness SLO + recency filter
"acl_ids": "text[]", # permission tags
"embed_model_id": "text", # blue/green safety pin
"content_hash": "text", # skip-unchanged + dedupe
}
# The retrieval query — every metadata column is exercised
def retrieve(query, tenant_id, user_acl, max_age_days):
qvec = embed_model.encode([query])[0]
return vector_store.search(
vector=qvec,
top_k=50,
filter={
"tenant_id": tenant_id,
"acl_ids": {"$overlap": user_acl},
"last_modified": {"$gte": days_ago(max_age_days)},
"embed_model_id": embed_model.id,
},
)
Step-by-step trace.
| Column | Used by |
|---|---|
| id | upsert idempotency, sidecar text lookup |
| vector | ANN cosine similarity |
| tenant_id | filter pushdown — multi-tenant safety |
| source | prompt attribution, trust signal |
| document_id | parent-chunk resolution, "show full doc" |
| chunk_idx | sibling lookup ("read the next chunk too") |
| content_type | content-aware rerank or rules |
| last_modified | freshness SLO, recency filter |
| acl_ids | permission pushdown (overlap match) |
| embed_model_id | blue/green safety — match read model |
| content_hash | skip-unchanged on reingest, dedupe |
The trace highlights that every metadata column earns its keep — none are decorative. Drop tenant_id and you have a cross-tenant leakage bug; drop embed_model_id and you have a silent model-mismatch bug; drop acl_ids and you have an authz failure.
Output:
| Decision | Rationale |
|---|---|
| 768 dims | balance recall vs storage |
| pgvector or Qdrant | pushdown filter support |
| ACL as array | overlap-match against user's permissions |
content_hash indexed |
re-ingest skips unchanged chunks |
last_modified indexed |
freshness queries are common |
Why this works — concept by concept:
- Metadata pushdown is the single biggest perf lever — restricting the ANN search to the tenant's slice changes the constants by 1-2 orders of magnitude at scale.
- embed_model_id is the silent-bug pin — the most subtle RAG failure mode is read and write paths using different embedding models; the metadata column makes the mismatch detectable in seconds.
- content_hash is the skip-unchanged pin — letting the ingest job re-embed only changed chunks turns nightly reindex from a multi-hour batch into a sub-minute incremental.
- ACL overlap match — modern vector stores support array-overlap as a filter primitive. Permission pushdown then runs at the same speed as tenant pushdown.
-
Cost — schema overhead is ~200 bytes per chunk for metadata; the index on
tenant_idandlast_modifiedis the difference between a 200ms and a 20ms query at 50M chunks. Every byte earns its keep.
DE
Topic — indexing
Indexing problems (DE)
5. Freshness, reindex, and ACLs
rag freshness is an SLO, not a feature — and the reindex playbook is what keeps it honest
The mental model in one line: freshness is the P95 lag between a source document being updated and the corresponding chunk being retrievable — and a production RAG pipeline either ships an explicit freshness SLO with telemetry or silently drifts into "the bot still cites the old policy." Once you state the SLO, every reindex strategy maps to "how do I keep this lag under X."
The freshness SLO.
- Definition. P95 source-to-retrieval lag — the time between a source-of-truth update (Confluence save, Postgres commit, S3 put) and the moment the new chunk is retrievable in the vector store.
- Typical thresholds. Knowledge base reference docs: 30-60 minutes. Live policy / pricing lookup: 1-5 minutes. Real-time support context: <30 seconds.
- Why SLO not feature. A feature is "we reindex nightly." An SLO is "we promise P95 < 5 minutes and we will page if it breaks." Only the SLO survives contact with stakeholders.
-
Telemetry.
now - max(last_modified) per sourcefor the embed worker; query-sidenow - retrieved_chunk.last_modified. Graph both; alert on either.
Incremental reindex via CDC.
- Source CDC. Postgres logical replication (Debezium), Confluence webhooks, Notion webhooks, S3 event notifications. The source emits "what changed" events into a topic (Kafka, Kinesis).
-
Embed worker. Consumes the topic, fetches the changed document, re-chunks, re-embeds only changed chunks (skip-unchanged via
content_hash), upserts vectors and metadata. -
Upsert semantics. Deterministic chunk_ids mean
upsertis idempotent — the same change replayed produces the same final state. - Backfill mode. A new source connector starts in "full backfill" (read every doc once) and graduates to "CDC tail" (read changes only).
Tombstoning deletes.
- The problem. A document deleted in the source must vanish from retrieval — but most teams only mark the source as deleted and forget the vector store.
-
The fix. When CDC emits a "delete" event, the embed worker either hard-deletes the chunk_ids for that document, or soft-deletes them by setting
is_active=falsein the metadata sidecar (and filtersis_active=trueon every retrieve). - Why soft-delete + nightly purge. Soft-delete is reversible (whoops, didn't mean to nuke that doc); nightly purge sweeps hard-deletes after a grace window.
- Test. Every nightly eval golden set includes a "deleted doc must not appear" canary. If it appears, page.
Embedding model upgrade = full re-embed.
- The rule. The vectors in the index were produced by model M. Switching to model M' invalidates every vector; queries embedded by M' do not match vectors from M.
-
The pattern. Blue/green collections. Create
collection_v2with the new model, re-embed everything, dual-write during transition, cut over reads atomically, then dropcollection_v1. - Cost. A full re-embed of 100M chunks at $0.02/M tokens × 500 tokens/chunk ≈ $1000 in API spend, plus the worker compute. Budget for it before promising the upgrade.
-
Validation. Run the golden-set eval against
v2before cutover. Ifrecall@5is not meaningfully better, the upgrade is not worth it.
Per-tenant ACL pushdown.
-
The shape. Each chunk carries
acl_ids: ["public", "support", "engineering"]. Each user hasuser_acl: ["public", "support"]. The retrieval filter isacl_ids OVERLAP user_aclpushed down into the vector store. -
Why pushdown not post-filter. Same reason as
tenant_id— post-filter forces huge over-fetch; pushdown is one extra predicate in the ANN traversal. - Sourcing the ACL list. Materialised at ingest time from the source's ACL system (Confluence space permissions, Notion page permissions, S3 bucket policy tags). Refreshed via CDC when permissions change.
-
Permission change is itself a CDC event. A user joins a new team → their
user_aclupdates → next query sees the wider chunk set. No reindex needed.
Quality regression detection.
- Golden set drift. Recall@5 / MRR@10 trended over time. A sudden drop after an ingest job means a recent change broke retrieval.
- Top-cause attribution. Diff the recent ingest config (chunker change, embedder change, schema change) against the last "green" deploy.
- Auto-rollback. Some teams configure auto-rollback on a sustained ≥5pt recall drop — return the index to its previous state until a human investigates.
Common interview probes on freshness and reindex.
- "What is your freshness SLO and how do you measure it?" — name a P95 number, name a telemetry source, name the alert.
- "How do you handle deletes?" — soft-delete in metadata + filter on retrieve + nightly hard-purge sweep.
- "How do you upgrade an embedding model in production?" — blue/green collections, dual-write, golden-set validation, atomic cutover.
- "How do you enforce per-user permissions?" —
acl_idsarray column on every chunk; overlap filter pushed down into the vector store; permission changes flow through the same CDC stream.
Worked example — CDC-driven incremental reindex worker
Detailed explanation. A streaming embed worker consumes a CDC topic, fetches the changed document, re-chunks, re-embeds only changed chunks (skip-unchanged via content_hash), and upserts. Sub-minute lag at steady state.
Question. Sketch the embed worker that consumes a source.changes Kafka topic and upserts vectors with sub-5-minute P95 lag.
Input. A stream of CDC events.
| event | doc_id | op | last_modified |
|---|---|---|---|
| 1 | PAGE-42 | upsert | 2026-06-15T10:00:00Z |
| 2 | PAGE-09 | delete | 2026-06-15T10:00:05Z |
| 3 | PAGE-42 | upsert | 2026-06-15T10:00:30Z |
Code.
def embed_worker_loop():
consumer = kafka.Consumer("source.changes", group_id="rag-embed-worker")
for event in consumer:
try:
if event["op"] == "delete":
handle_delete(event["doc_id"])
else:
handle_upsert(event["doc_id"])
consumer.commit(event)
metric("rag.source_lag_seconds", now() - event["source_ts"])
except Exception as e:
log.error("embed_worker_error", error=str(e), event=event)
consumer.send_to_dlq(event)
def handle_upsert(doc_id: str) -> None:
doc = source.fetch(doc_id) # idempotent fetch by id
new_chunks = chunk_dispatch(doc) # per-content-type chunker
existing = vector_store.scan(filter={"document_id": doc_id})
existing_hashes = {c.metadata["content_hash"] for c in existing}
to_upsert = []
for new in new_chunks:
h = hashlib.sha256(new["text"].encode()).hexdigest()
if h in existing_hashes:
continue # skip unchanged
new["metadata"]["content_hash"] = h
to_upsert.append(new)
if to_upsert:
vectors = embed_model.encode([c["text"] for c in to_upsert])
for c, v in zip(to_upsert, vectors):
c["vector"] = v
vector_store.upsert(to_upsert)
def handle_delete(doc_id: str) -> None:
# Soft-delete — mark as inactive; nightly purge sweeps hard deletes
vector_store.update_metadata(
filter={"document_id": doc_id},
patch={"is_active": False, "deleted_at": now()},
)
Step-by-step explanation.
- The worker consumes one event at a time, processes it, and commits the offset only after the upsert succeeds. At-least-once delivery semantics are fine because chunk_ids are deterministic and
upsertis idempotent. - On an upsert event, the worker fetches the full document from the source (CDC events typically carry only IDs, not bodies), re-chunks, and computes the content hash of each chunk.
- The skip-unchanged optimisation: chunks whose hash matches a previously stored one are not re-embedded. For a single-paragraph edit on a 50-paragraph doc, only 1-2 chunks change — the savings compound.
- On a delete event, the worker soft-deletes by setting
is_active=falseon every chunk for that doc. The retrieval path filtersis_active=true, so deleted chunks are invisible immediately. - A nightly purge job hard-deletes rows where
deleted_at < now() - 7 days, freeing index space. -
metric("rag.source_lag_seconds", now() - event.source_ts)is the per-event freshness measurement. Aggregated as P95 across an hour, this is the freshness SLO. - Errors go to a DLQ (dead-letter queue) so a single bad doc cannot block the whole stream. The DLQ has its own alert.
Output.
| event | action | chunks re-embedded | lag |
|---|---|---|---|
| 1 (PAGE-42 upsert) | fetch + chunk + embed | 3 of 12 changed | ~45 s |
| 2 (PAGE-09 delete) | soft-delete | 0 (metadata only) | ~5 s |
| 3 (PAGE-42 upsert) | fetch + chunk + embed | 1 of 12 changed | ~30 s |
Rule of thumb. The skip-unchanged + soft-delete + DLQ trio is the production shape of a CDC embed worker. Without them you re-embed everything on every change (cost explodes), you hard-delete on every event (no recovery), or you crash on the first malformed doc (whole stream stalls).
Worked example — blue/green embedding model upgrade
Detailed explanation. A model upgrade invalidates every vector in the index. The safe pattern is blue/green: build a parallel v2 collection with the new model, dual-write incoming CDC events into both, validate against the golden set, cut over reads atomically, drop the old collection.
Question. Sketch the blue/green upgrade flow from embedding model e5-large-v1 to e5-large-v2 for a 50M-chunk index.
Input. Existing collection_v1 with all chunks embedded by e5-large-v1.
Code.
# Phase 1 — Create v2 collection and full re-embed (background)
def backfill_v2():
create_collection("collection_v2", dims=v2_model.dims)
for batch in scan_chunks("collection_v1", batch_size=1000):
texts = [text_store.get(c.id) for c in batch]
vectors = v2_model.encode(texts)
rows = []
for c, v in zip(batch, vectors):
rows.append({
**c.as_dict(),
"vector": v,
"metadata": {**c.metadata, "embed_model_id": "e5-large-v2"},
})
upsert("collection_v2", rows)
# Phase 2 — Dual-write incoming CDC into both collections
def handle_upsert_dualwrite(doc_id: str):
doc = source.fetch(doc_id)
new_chunks = chunk_dispatch(doc)
v1_vecs = v1_model.encode([c["text"] for c in new_chunks])
v2_vecs = v2_model.encode([c["text"] for c in new_chunks])
for c, v1, v2 in zip(new_chunks, v1_vecs, v2_vecs):
upsert("collection_v1", [{**c, "vector": v1}])
upsert("collection_v2", [{**c, "vector": v2}])
# Phase 3 — Validate v2 against the golden set
def validate_v2() -> bool:
v1_metrics = score_pipeline(golden, collection="collection_v1")
v2_metrics = score_pipeline(golden, collection="collection_v2")
return v2_metrics["recall_at_k"] >= v1_metrics["recall_at_k"] - 0.005
# Phase 4 — Atomic read cutover via feature flag
def retrieve(query, tenant_id):
collection = "collection_v2" if FLAGS["read_v2"] else "collection_v1"
model = v2_model if FLAGS["read_v2"] else v1_model
qvec = model.encode([query])[0]
return search(collection, qvec, filter={"tenant_id": tenant_id})
# Phase 5 — Drop v1 after a 7-day soak
def cleanup_v1():
if FLAGS["read_v2"] and days_since_cutover() >= 7:
drop_collection("collection_v1")
Step-by-step explanation.
- Phase 1 (backfill): a background job iterates every chunk in
v1, re-embeds with the new model, writes intov2. Heavy compute and API spend; runs over hours or days for a large index. - Phase 2 (dual-write): from the moment the backfill starts, every CDC event is written into both collections. This keeps
v2in sync with new edits while the backfill is still running. - Phase 3 (validate): run the golden-set eval against both collections.
v2must match or beatv1on recall@5 / MRR@10. A regression is a stop sign. - Phase 4 (cutover): flip a feature flag. The read path switches from
v1tov2and fromv1_modeltov2_model. The flip is instant and reversible. - Phase 5 (cleanup): after a 7-day soak (during which you can flip back if something breaks), drop
v1and reclaim storage. - Throughout: dual-write costs 2x the embed compute. Plan the rollout so the cost window is bounded — typically a few days, not weeks.
Output.
| phase | duration | risk |
|---|---|---|
| backfill | hours-days | high cost, no user impact |
| dual-write | duration of rollout | 2x embed cost |
| validate | 1-3 days | catches regressions |
| cutover | instant | reversible via flag |
| cleanup | after 7d soak | irreversible |
Rule of thumb. Never hot-swap an embedding model. Always go blue/green: backfill → dual-write → validate → atomic cutover → soak → drop old. The 2x embed cost during rollout is the price you pay for zero downtime and instant rollback.
Worked example — per-tenant ACL pushdown end-to-end
Detailed explanation. Multi-tenant RAG with per-document permissions means every chunk carries an acl_ids array and every query carries the user's user_acl. The retrieval predicate is acl_ids OVERLAP user_acl, pushed down into the vector store.
Question. Show the ingest-side ACL materialisation and the retrieval-side overlap filter for a Confluence-backed RAG with space-level permissions.
Input. A Confluence page with space ACLs.
| field | value |
|---|---|
| page_id | "PAGE-42" |
| space_id | "ENG" |
| space_acl | ["eng-team", "leadership"] |
Code.
# Ingest — materialise ACL on every chunk
def ingest_with_acl(page):
acl_ids = space_acl_resolver.resolve(page["space_id"]) # ["eng-team", "leadership"]
chunks = chunk_dispatch(page)
for c in chunks:
c["metadata"]["acl_ids"] = acl_ids
c["metadata"]["space_id"] = page["space_id"]
vectors = embed_model.encode([c["text"] for c in chunks])
for c, v in zip(chunks, vectors):
c["vector"] = v
vector_store.upsert(chunks)
# Retrieve — overlap filter pushed down
def retrieve_for_user(query, user_id, tenant_id):
user_acl = user_directory.get_acl(user_id) # e.g. ["eng-team", "all-hands"]
qvec = embed_model.encode([query])[0]
return vector_store.search(
vector=qvec,
top_k=50,
filter={
"tenant_id": tenant_id,
"acl_ids": {"$overlap": user_acl}, # array overlap
"is_active": True,
},
)
# Permission change — same CDC stream
def on_user_acl_change(user_id):
# No reindex needed — next query picks up the new ACL
user_directory.invalidate(user_id)
def on_space_acl_change(space_id):
# Re-materialise ACL on every chunk in this space
new_acl = space_acl_resolver.resolve(space_id)
vector_store.update_metadata(
filter={"space_id": space_id},
patch={"acl_ids": new_acl},
)
Step-by-step explanation.
- At ingest, the chunker gets the space's ACL list and writes it into every chunk's metadata. The space ID is also stored so the chunks can be updated atomically if the space ACL changes.
- The retrieval query pushes
acl_ids OVERLAP user_acldown into the vector store. The user sees only chunks whose ACL intersects their permission set. - User permission change (joins a new team) does not require reindex — the user's
user_aclupdates in the directory, next query reflects it. - Space permission change (a previously-private space becomes shared with another team) does require updating every chunk in that space — but the update is a metadata patch, not a re-embed. Sub-second on most vector stores.
- The overlap filter is one extra predicate in the ANN search, so the perf cost is negligible compared to the tenant filter.
- Combined with
tenant_idandis_active, the retrieval contract is "tenant-isolated, ACL-filtered, soft-delete-aware." Three predicates, all pushed down, all enforced in the vector store itself.
Output.
| user | user_acl | accessible chunks |
|---|---|---|
| alice (eng) | ["eng-team", "all-hands"] | PAGE-42 (eng-team) + public |
| bob (sales) | ["sales", "all-hands"] | not PAGE-42 |
| ceo | ["leadership", "all-hands"] | PAGE-42 (leadership) + public |
Rule of thumb. ACL pushdown belongs in the same query as the tenant filter — both filters live or die together in the vector store. Application-side post-filter for permissions is an authz time bomb; never ship it as the primary boundary.
Data engineering interview question on freshness and tombstoning
A senior interviewer might frame this as: "Stakeholder reports the bot is still citing a deleted policy doc. How do you design the pipeline so this cannot happen, and what is the SLO you commit to?"
Solution Using soft-delete + freshness SLO + golden-set canary
# 1) Soft-delete on CDC delete events
def handle_delete(doc_id):
vector_store.update_metadata(
filter={"document_id": doc_id},
patch={"is_active": False, "deleted_at": now()},
)
# 2) Retrieve filters out inactive chunks
def retrieve(query, tenant_id):
return vector_store.search(
vector=embed_model.encode([query])[0],
top_k=50,
filter={"tenant_id": tenant_id, "is_active": True},
)
# 3) Nightly purge sweeps hard deletes
def nightly_purge():
vector_store.delete(filter={
"is_active": False,
"deleted_at": {"$lt": days_ago(7)},
})
# 4) Golden-set canary catches regressions
DELETED_DOC_CANARY = {
"question": "What was the old refund window?",
"must_not_match_source": "confluence/PAGE-deleted-99",
}
def canary_check():
hits = retrieve(DELETED_DOC_CANARY["question"], tenant_id="default")
leaked = any(h.metadata["source"] == DELETED_DOC_CANARY["must_not_match_source"]
for h in hits)
if leaked:
page_oncall("deleted_doc_leak", source=DELETED_DOC_CANARY)
Step-by-step trace.
| Step | What it does | SLO |
|---|---|---|
| 1 soft-delete | metadata is_active=false on CDC delete |
P95 < 30 s |
| 2 retrieve filter | only is_active=true chunks returned |
always |
| 3 nightly purge | hard-delete after 7-day soak | nightly |
| 4 canary | golden-set probe for known-deleted docs | every nightly eval |
The trace highlights that deletion safety is the conjunction of four steps, not any one of them — the soft-delete handles the instant invisibility, the retrieve filter enforces it, the nightly purge reclaims storage, and the canary catches the day someone breaks step 2.
Output:
| Failure mode | Catch step |
|---|---|
| CDC delete event missed | step 4 (canary in next eval) |
| Soft-delete not yet propagated | step 1 (30s SLO) |
| Retrieve filter dropped | step 4 (canary fires) |
| Hard purge premature | 7-day soak window |
Why this works — concept by concept:
- Soft-delete first, hard-delete later — gives a reversible window. Whoops-deletes recover in seconds; intentional deletes flush after 7 days.
-
Retrieve-side filter on
is_active— the safety boundary lives in the query, not in the deletion handler. Even if the delete event is replayed or lost, the next query still respectsis_active. - Golden-set canary — a known-deleted doc in the golden set is a continuous integration test for deletion. If the bot ever cites it, the eval fails before the user sees it.
- Freshness SLO ties the loop — the P95 source-to-retrieval lag SLO covers both adds and deletes; one number, alerted on, owned by DE.
- Cost — soft-delete is one metadata write per deleted chunk; canary is one extra row in the golden set; purge is a nightly bulk delete. All bounded; none cost more than a few minutes a day.
DE
Topic — data validation
Data validation problems (DE)
Cheat sheet — RAG pipeline recipes
- Chunk size starter. 500 tokens with 75-token overlap (15%). Tune against the golden set before changing.
- Per-content-type strategy. Prose → recursive; code → AST; tables → atomic; transcripts → semantic; long-form policy → hierarchical (sentence child, paragraph parent).
-
Hybrid score fusion. Reciprocal Rank Fusion (RRF) with
k=60— robust, no tuning. Weighted0.6 * dense + 0.4 * BM25only if you have a held-out tuning set. -
Metadata filter on retrieve.
WHERE tenant_id = $1 AND acl_ids OVERLAP $2 AND is_active = true— pushed down into the vector store, never post-filter. - Reranker shape. Top-50 ANN candidates → cross-encoder batch → top-5 prompt context. Wrap with 120ms timeout and graceful degradation.
-
Embed model rule. Store
embed_model_idin every chunk's metadata; assert it matches the query embedder at retrieval time. Catches the silent model-mismatch bug. - Freshness pipeline. Source CDC → Kafka → embed worker → vector upsert. P95 source-to-retrieval lag SLO < 5 minutes (tune to use case).
- Skip-unchanged. Hash chunk text with SHA-256; store in metadata; re-embed only when hash changes. Turns nightly reindex into sub-minute incremental.
-
Tombstone delete. Soft-delete via
is_active=falsein metadata; filter on retrieve; nightly purge sweep hard-deletes after 7 days. - Embedding model upgrade. Blue/green collections — backfill → dual-write → golden-set validate → atomic cutover → 7-day soak → drop old. Never hot-swap.
-
ACL pushdown.
acl_idsarray on every chunk;acl_ids OVERLAP user_aclfilter pushed down. Permission changes update the user directory, not the chunks (except space-level changes). -
Golden set. 200-2000
(question, expected_chunk_id)triples maintained by SMEs. Nightly recall@5 and MRR@10 with alerts on a sustained 5pt drop. - Deletion canary. A known-deleted doc in the golden set whose source must not appear in retrieval. Continuous integration for deletion.
- DLQ on the embed worker. Errors do not block the stream; a malformed doc goes to a DLQ with its own alert.
- Telemetry. Source lag, embed queue depth, ANN P95, reranker P95, fallback ratio, recall@5. Six numbers, six graphs.
Frequently asked questions
What chunk size should I start with for a new RAG pipeline?
Start at 500 tokens with 75-token overlap (15%). It is the empirically calibrated default that lands inside most embedding model context windows (most are 512 tokens) and leaves enough room for the LLM to receive 5-10 chunks per prompt. Tune from there against your golden set — long-form policy docs often benefit from hierarchical chunking (120-token children, 600-token parents); transcripts benefit from semantic chunking with similarity threshold 0.55; code should be split by AST function/class boundaries with no overlap. The cheat-sheet recipe to memorise: "500/75 default, per-content-type adapters, golden-set tuning."
Do I need hybrid search or is dense retrieval enough?
You almost always need hybrid search in production. Pure dense retrieval misses on rare keywords (error codes, product SKUs, proper nouns the embedder has never seen), which is where 10-15 points of recall@5 hide. Pure BM25 misses on semantic synonyms ("revenue" vs "income"). Fusing them with Reciprocal Rank Fusion (RRF, k=60) recovers both regimes without per-system weight tuning. Skip hybrid only if (a) your content is all natural-language prose with no rare-vocab terms and (b) your golden-set recall is already at the threshold you need. In every other case, hybrid is the 2026 default for hybrid search.
How do I keep my RAG index fresh?
State a freshness SLO (e.g. P95 source-to-retrieval lag < 5 minutes), implement a CDC pipeline (Postgres logical replication, Confluence webhooks, Notion webhooks, S3 events) that emits change events into Kafka, and run an embed worker that consumes the stream, re-chunks the changed document, and upserts the new vectors with deterministic chunk_ids. Use content_hash to skip-unchanged chunks so a one-paragraph edit only re-embeds one chunk. For deletes, soft-delete via is_active=false in metadata and filter on retrieve; nightly purge sweeps hard-deletes after a 7-day soak. Graph now - max(last_modified) per source as the SLO telemetry; alert on breaches.
When do I need a reranker?
You need a reranker as soon as top-5 precision matters — which in practice is "as soon as your bot ships to real users." The standard shape is hybrid retrieves 50 candidates, the cross-encoder reranks all 50 in one batch, top-5 go into the prompt. Reranker adds 50-100ms; lifts recall@1 and MRR significantly because cross-encoders apply full attention across the query and the chunk text, not just a dot product. Cohere rerank-v3 and BGE-reranker are the common hosted/OSS picks. Always wrap the call with a timeout and a graceful-degradation fallback to "hybrid top-k without rerank" — never fail the user-facing query because the reranker hiccupped.
How do I enforce per-user permissions in RAG?
Store an acl_ids array on every chunk at ingest time (materialised from the source's permission system — Confluence space ACL, Notion page permissions, S3 bucket tags), and store each user's user_acl in a directory service. At retrieval time, push a acl_ids OVERLAP user_acl predicate down into the vector store alongside the tenant_id filter. Never apply ACL as an application-side post-filter — the perf cost is huge (massive over-fetch needed) and the safety story is weaker because the boundary lives in two places. User permission changes update the directory only (no reindex needed); space-level permission changes patch the acl_ids metadata on every chunk in that space (sub-second on most vector stores).
How do I evaluate RAG quality before shipping?
Build a golden set of 200-2000 (question, expected_chunk_id, expected_answer) triples with subject-matter experts. Run a nightly eval that fires every question through the live retrieval pipeline (same embed model, same filter, same rerank) and scores recall@5 (was the gold chunk in the top-5?) and MRR@10 (1/rank, averaged). Threshold recall@5 ≥ 0.85 as a typical production gate; alert on a sustained 5-point drop. Add a "deleted doc canary" to the golden set so deletion regressions surface. For online quality, sample live queries and score them with an LLM-as-judge against a rubric — useful for drift detection but not as ground truth. Without a golden set, every RAG change is a vibe-based decision.
Practice on PipeCode
- Drill the ETL pipeline practice library → for the ingest, chunk, and embed stages of a RAG pipeline.
- Rehearse on streaming problems → when the interviewer wants CDC-driven freshness pipelines.
- Sharpen data transformation drills → for the chunker and metadata sidecar shaping work.
- Layer the indexing library → for the metadata pushdown and ANN filter patterns.
- Stack the data validation library → for the golden-set eval harness and deletion canary patterns.
- Cover the real-time analytics library → for the freshness SLO and end-to-end latency patterns.
- For the broader DE surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the foundations with the ETL system design course →.
- For pipeline orchestration craft, work through Apache Spark internals for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every RAG pipeline recipe above ships with hands-on practice rooms where you write the CDC consumer, the chunker, the hybrid retrieval query, and the golden-set eval harness against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your `rag data pipeline` will behave the same in production as it did on the whiteboard.





Top comments (0)