DEV Community

Cover image for RAG Data Pipelines: Chunking, Embeddings, Vector Stores & Freshness
Gowtham Potureddi
Gowtham Potureddi

Posted on

RAG Data Pipelines: Chunking, Embeddings, Vector Stores & Freshness

rag pipeline looks like a model problem to a newcomer — interviewers know it is really a four-stage data pipeline problem dressed up in transformer vocabulary. The model is the cheap part; the expensive, error-prone, on-call-rotation part is the ingest, chunk, embed, index, and refresh loop that decides what context the model ever sees. When a retrieval-augmented generation system gives wrong answers in production, the fix almost never lives in the prompt — it lives in the pipeline.

This guide is the cheat sheet you wished existed the first time a stakeholder asked "why does the bot still cite the old policy?" It walks the end-to-end architecture, the four families of chunking strategies (fixed, recursive, semantic, hierarchical), embedding model selection and the metadata sidecar that travels with every chunk, hybrid dense + BM25 retrieval with cross-encoder reranking, and the freshness SLO + reindex playbook that keeps a rag data pipeline from drifting. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for a RAG data pipeline tutorial — bold white headline 'RAG Data Pipelines' with subtitle 'chunking · embeddings · vector stores · freshness' and a stylised left-to-right four-stage flow with document icons becoming chunks becoming embedding orbs becoming a vector store, terminating in a glowing retrieval arrow on a dark gradient with purple, green, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the streaming practice library →, rehearse on ETL pipeline problems →, and stack the data-modelling muscles with dimensional modeling drills →.


On this page


1. RAG as a data pipeline problem, not a prompting problem

The model is the cheap part — the pipeline is where 80% of rag pipeline quality issues live

The one-sentence invariant: a RAG system is only as good as the chunks the retriever can return, and the chunks are only as good as the ingest, normalize, split, and embed pipeline that produced them. Once you internalise that "retrieval is the bottleneck," every late-night "why did it hallucinate that fact?" ticket resolves into one of four pipeline questions — was the source ingested? was it chunked at a sensible boundary? was the embedding model right for the content? was the index fresh?

The four stages every rag data pipeline shares.

  • Ingest. Pull documents from the source — Confluence, S3, Postgres, Notion, support tickets, code repos — into a staging zone. Strip boilerplate, expand attachments, dedupe. The output is a clean text corpus tagged with source metadata.
  • Chunk + embed. Split each document into retrieval-sized units (paragraphs, sections, sliding windows) and feed each chunk through an embedding model to produce a dense vector. Persist the chunk text, the vector, and a metadata sidecar.
  • Index. Push every vector into a vector store (Pinecone, Weaviate, pgvector, Qdrant, Milvus) and keep the chunk text in a sidecar store (Postgres, S3) so the retrieval path can hand the model the text, not just the vector ID.
  • Retrieve + rerank. At query time, embed the user question with the same model, do an approximate-nearest-neighbour search, fuse with a keyword score (BM25), rerank the top-N with a cross-encoder, and assemble the top-k chunks into the prompt.

Three places quality silently dies.

  • At the chunk boundary. A fixed-size chunker that splits mid-sentence loses the conceptual unit. The retriever returns the right vicinity but the model gets half a sentence — and confidently completes it the wrong way.
  • At the metadata boundary. No tenant_id on a chunk means the multi-tenant retrieval filter has nothing to filter on, and tenant A starts seeing tenant B's documents in the answer. This is the most expensive bug in retrieval augmented generation shipping today.
  • At the freshness boundary. A nightly batch reindex means "the new policy" is invisible to retrieval until 3am tomorrow. By then the stakeholder has already gone to Slack and screenshotted the wrong answer.

Where data engineers own the work.

  • DE owns ingest, chunking, embedding orchestration, vector store schema, metadata sidecar, freshness SLO, ACL pushdown, eval harness ingestion. Everything between source-of-truth and the moment a chunk arrives in the prompt.
  • ML / applied scientists own the embedding model choice, the reranker model choice, prompt template tuning, and the eval scoring rubric. Everything that decides "given a chunk, how is it scored / consumed."
  • Boundary. The eval harness is shared: DE supplies the golden Q&A inputs and the evaluation runs against the deployed pipeline; ML defines the metrics (recall@k, MRR, faithfulness).

The 2026 reality.

  • Vector stores are commoditising fast. pgvector inside Postgres handles low-tens-of-millions of vectors with reasonable latency; Pinecone, Qdrant, Weaviate, and Milvus take over above that. The choice is now a TCO decision, not a feature one.
  • Hybrid search is the default. Pure dense retrieval lost ~5-15 points of recall on out-of-distribution keywords vs hybrid; teams now ship dense + BM25 score fusion (RRF or weighted) as table stakes.
  • Reranking is mandatory above ~10 users. Cross-encoders are 50-100x slower than ANN but lift top-k precision dramatically. The standard shape is top-50 ANN → reranker → top-5 prompt.
  • Freshness is a contract. The mature pattern is CDC-from-source → embed worker → vector upsert, with a P95 source-to-retrieval lag SLO measured in minutes, not hours.

Worked example — the four ways RAG silently returns the wrong answer

Detailed explanation. Most "the bot lies" tickets fall into four pipeline buckets, and the fastest way to fix them is to know which bucket before you touch the prompt template. The four are: chunk boundary, missing metadata filter, stale index, embedding-model mismatch. Each has a code-shaped fix.

Question. A support bot keeps citing the wrong refund policy for tenant acme. The right policy is in Confluence, but the bot pulls the old one. List the four failure modes that could explain this and show the smallest pipeline-side test for each.

Input. A retrieved-context audit log row per query.

query_id tenant_id retrieved_chunk_id retrieved_chunk_source retrieved_chunk_text last_modified
q1 acme c123 confluence/acme-refunds-v1 "Refunds are 14 days..." 2025-08-01

Code.

# Four pipeline-side tests — run each before touching the prompt

# 1) Chunk boundary test: does the right chunk even exist?
chunks = vector_store.search(query="acme refund policy", filter={"tenant_id": "acme"})
for c in chunks[:5]:
    print(c.source, c.text[:120])
# expected: at least one chunk from confluence/acme-refunds-v2

# 2) Metadata filter test: is tenant_id present on the new doc's chunks?
new_chunks = vector_store.scan(filter={"source": "confluence/acme-refunds-v2"})
assert all(c.metadata.get("tenant_id") == "acme" for c in new_chunks)

# 3) Freshness test: how stale is the most-recent chunk for this source?
latest = max(c.metadata["last_modified"] for c in new_chunks)
print("source-to-retrieval lag:", now() - latest)

# 4) Embedding model test: same model on both write and read paths?
assert embed_model_id == vector_store.collection.embed_model_id
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Test 1 confirms whether the new chunk is retrievable at all. If it is not in the top results for an obvious query, the chunking strategy or the embedding step failed. If it is in the top 50 but not top 5, you have a reranker / score fusion problem, not a retrieval problem.
  2. Test 2 confirms the metadata sidecar is intact. A common shape: the ingest job upserted text+vector but skipped the metadata payload. Without tenant_id, the multi-tenant filter pulls nothing for acme and the system falls back to a cross-tenant default — which still contains the old v1 doc.
  3. Test 3 quantifies the freshness lag. If "now - last_modified" exceeds the SLO, the CDC stream is stalled or the embed worker is backed up. The fix is operational (drain the queue), not algorithmic.
  4. Test 4 catches the silent killer: the team upgraded the embedding model on the write path but the retrieval service still embeds queries with the old one. Vectors are in different spaces; cosine similarity is meaningless. The fix is a blue/green collection swap, not a hot upgrade.

Output.

Failure mode Test signal Owner
Chunk missing / mis-split search returns no v2 source DE — chunker
Missing metadata no tenant_id on v2 chunks DE — ingest job
Stale index source-to-retrieval lag > SLO DE — CDC + worker
Embedding model mismatch embed model IDs differ ML + DE

Rule of thumb. Before touching the prompt template, run the four pipeline tests. If all four pass and the answer is still wrong, then the problem is in the model or the prompt — and you can hand it to the ML team with high confidence the data layer is clean.

Worked example — why "just use a bigger context window" does not fix bad chunks

Detailed explanation. A common temptation when a RAG system gives wrong answers is to widen the context window and stuff more chunks into every prompt. This reliably makes the bill explode without lifting accuracy, because the recall problem (the right chunk was not retrieved at all) is not solved by giving the model more wrong chunks. The fix is upstream: better chunking, better embeddings, hybrid + rerank — not a bigger prompt.

Question. Your support bot retrieves the top-3 chunks and returns wrong answers ~12% of the time. A colleague proposes lifting top-k from 3 to 30 (10x bigger context). Quantify why this is unlikely to fix the bug.

Input. Retrieval log breakdown over 1000 wrong-answer queries.

failure cause share of wrong answers
right chunk not in top-50 at all 62%
right chunk in top-50 but ranked outside top-3 25%
right chunk in top-3 but model still hallucinated 13%

Code.

# Failure mode breakdown — a tiny helper to bucket wrong answers
def diagnose(query, gold_chunk_id, top_n=50):
    candidates = vector_store.search(query, k=top_n)
    ranks = [i for i, c in enumerate(candidates) if c.id == gold_chunk_id]
    if not ranks:
        return "not_in_top_n"          # recall miss — bigger k will not help
    rank = ranks[0]
    if rank >= 3:
        return "rank_too_low"          # rerank fix
    return "model_failed"              # prompt / model fix

buckets = collections.Counter(diagnose(q.text, q.gold_id) for q in failed_queries)
print(buckets)
# Counter({'not_in_top_n': 620, 'rank_too_low': 250, 'model_failed': 130})
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The "lift top-k from 3 to 30" idea only helps the 25% of wrong answers where the right chunk was in top-50 but ranked below 3 — and only partially, because the model now also has 27 extra noisy chunks to confuse it.
  2. The dominant failure mode (62%) is recall miss — the right chunk is nowhere in the top-50. Increasing k from 3 to 30 changes none of those queries: the chunk is still missing. The fix is upstream — better chunking strategy, hybrid + BM25 score fusion, or re-embedding with a stronger model.
  3. The "model failed" bucket (13%) is the only one a prompt or model upgrade can fix, and it is the smallest bucket.
  4. Net: lifting top-k to 30 fixes at most 5-8% of the failures (the easier rerank misses), at 10x the LLM token cost. The right ROI is improving recall and adding a reranker.

Output.

Action Failures fixed Cost change
Lift top-k from 3 to 30 ~5-8% 10x prompt tokens
Add hybrid + BM25 fusion ~40-50% 2x ingest, ~0 query
Add cross-encoder reranker ~20% (on top of hybrid) +50ms latency
Rechunk with semantic splitter ~15% one-time reindex

Rule of thumb. Diagnose the failure bucket before spending money. "Lift top-k" is the single most-tempting and least-effective RAG fix — every dollar in extra prompt tokens is roughly five dollars not spent on the actual recall problem upstream.

Data engineering interview question on rag pipeline ownership

A senior interviewer often opens with: "Walk me through the four stages of a production RAG pipeline, who owns each stage, and the single most common failure mode in each. Where does a data engineer add the most value?" It blends pipeline architecture, retrieval, and freshness into one ownership map.

Solution Using the stage-ownership matrix

# A minimal pipeline-ownership map — runnable as a doc-test
STAGES = [
    {
        "stage": "ingest",
        "owner": "DE",
        "common_failure": "missing source connector or stale CDC stream",
        "telemetry": "source_lag_seconds",
    },
    {
        "stage": "chunk_embed",
        "owner": "DE (chunker) + ML (model choice)",
        "common_failure": "chunk size too small / boundary mid-sentence",
        "telemetry": "chunks_per_doc, embed_qps",
    },
    {
        "stage": "index",
        "owner": "DE",
        "common_failure": "missing metadata sidecar (tenant_id, ACL)",
        "telemetry": "metadata_coverage_pct",
    },
    {
        "stage": "retrieve_rerank",
        "owner": "DE (infra) + ML (reranker)",
        "common_failure": "embedding model mismatch between write and read",
        "telemetry": "recall_at_5, mrr_at_10",
    },
]

for s in STAGES:
    print(f"{s['stage']:>16}  owned by {s['owner']:<30}  watch: {s['telemetry']}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Stage Owner Common failure Telemetry
ingest DE stale CDC stream source_lag_seconds
chunk_embed DE + ML chunk size too small chunks_per_doc
index DE missing metadata metadata_coverage_pct
retrieve_rerank DE + ML embed model mismatch recall_at_5

The trace highlights that three of four stages are DE-owned outright, and the fourth is shared. The senior signal in this answer is naming the metadata sidecar coverage as the biggest preventable failure — most candidates focus on the embedding model and miss that the index schema is where the multi-tenant safety lives.

Output:

Stage DE value-add Single biggest win
ingest source connectors + CDC sub-minute source lag
chunk_embed strategy per content type semantic + hierarchical
index metadata schema + ACL 100% tenant_id coverage
retrieve_rerank hybrid + filter pushdown dense + BM25 fusion

Why this works — concept by concept:

  • Stage ownership matrix — naming the owner per stage frames RAG as a data product, not a model deployment. Interviewers reward this framing because it predicts how the candidate will run on-call rotations.
  • Telemetry per stage — the senior move is to attach an observable metric to each stage so the on-call rotation can localise failures in seconds. Source lag, metadata coverage, recall@5 are the canonical three.
  • Metadata sidecar is the safety story — every chunk carries tenant_id, source, ACL, last_modified — these are not optional, they are the only thing keeping retrieval augmented generation from cross-tenant leakage and stale answers.
  • Hybrid + rerank is the precision story — dense retrieval alone misses on rare keywords; BM25 alone misses on synonyms; fusion + rerank is the production default for hybrid search.
  • Cost — ingest is O(docs) one-time + O(changes) per CDC tick; embed is O(chunks) at write time; retrieve is O(log N) ANN + O(top-N) rerank per query. Reranker dominates query latency; embed dominates ingest latency.

DE
Topic — ETL
ETL pipeline problems (DE)

Practice →


2. End-to-end RAG pipeline architecture

The four-stage rag data pipeline — ingest, chunk + embed, index, retrieve + rerank — and the metadata sidecar that travels with every chunk

The mental model in one line: a production rag pipeline is two pipelines joined at a vector store — an offline batch+CDC pipeline that ingests, chunks, embeds, and upserts; and an online request pipeline that embeds the query, searches, reranks, and assembles the prompt. Once you draw those two flows on a whiteboard, every RAG architecture question collapses into "where does this responsibility sit?"

End-to-end RAG architecture diagram — top row is an offline ingest path (sources → normalize → chunk → embed → vector store), bottom row is the online retrieval path (query → embed → ANN search → rerank → LLM context), connected by the vector store in the middle; an observability strip on the right shows a small line-chart icon and an eval-set card, on a light PipeCode card.

The offline ingest pipeline.

  • Source connectors. One per source system. Confluence, Notion, Google Drive, S3, Postgres CDC, Slack export, GitHub repo, support ticket export. Each emits raw documents into a staging bucket with provenance metadata.
  • Normalize + strip. Strip HTML / Markdown formatting, remove boilerplate (navigation, footers, code-of-conduct banners), expand attachments, OCR images if needed, dedupe near-identical pages.
  • Split into logical units. Section-aware split: headings define sections, paragraphs define chunks within sections. Tables and code blocks are split as atomic units (do not chunk a table).
  • Chunker. Apply the per-content-type chunking strategy (covered in detail in section 3). Output is (doc_id, chunk_idx, text, metadata).
  • Embedder. Batch-embed chunks through the embedding model. Cache by content hash so unchanged chunks are not re-embedded on every run.
  • Vector store upsert. Upsert (vector, chunk_id, metadata) into the vector store. The chunk text goes to a sidecar store (Postgres, DynamoDB, S3) keyed by chunk_id.

The online retrieval pipeline.

  • Query embed. The same embedding model encodes the user query into a vector. Critical: write and read paths must use the same model.
  • ANN search with metadata filter. Approximate-nearest-neighbour search returns top-N candidates restricted by the metadata filter (tenant_id, source, ACL, recency window).
  • Score fusion (hybrid). Dense ANN scores are fused with a sparse BM25 score from a parallel inverted index. Weighted sum or Reciprocal Rank Fusion (RRF) are the two common shapes.
  • Reranker. A cross-encoder model takes the query + each of the top-N candidates and produces a precision-tuned score. Top-k after rerank are the chunks that go into the prompt.
  • Prompt assembly. Top-k chunks are concatenated with provenance headers ("Source: confluence/page-123, modified 2026-06-10") so the LLM can cite sources.

The metadata sidecar — what every chunk carries.

  • tenant_id — for multi-tenant SaaS, every retrieval is filtered by tenant_id = ?. No exceptions.
  • source — where the chunk came from (confluence/page-123, s3://bucket/key, postgres://schema.table#row). Powers attribution and trust signals.
  • document_id — group chunks back to their parent document for "show me the full doc" follow-ups.
  • last_modified — for freshness SLO measurement and time-window filters.
  • acl_ids — list of permission tags that the retrieval filter intersects with the user's permissions.
  • content_typeprose, code, table, transcript — used to pick the right reranker or trigger content-specific rules.
  • embed_model_id — the embedding model version that produced this vector. Critical for blue/green model swaps.

Observability — what to graph.

  • Source lag. Per source, the P95 source-to-retrieval lag. The single most important SLO for rag freshness.
  • Index hit rate. What fraction of queries return any chunks at all (low = empty store or over-filtered metadata).
  • Fallback ratio. Fraction of queries that fall through to the "no context" prompt path. A sharp uptick is the first signal of a stalled embed worker.
  • Recall@5 on golden set. Offline metric scored nightly against a curated Q&A set with known-correct chunk IDs.
  • Reranker latency P95. Cross-encoders dominate query latency; alert on regressions.

The eval harness.

  • Golden Q&A set. 200-2000 curated (question, expected_chunk_id, expected_answer) triples maintained by the product team plus SMEs.
  • Offline eval. Nightly run scores recall@k, MRR, faithfulness against the golden set. Drops trigger PagerDuty.
  • Online eval (LLM-as-judge). Sample of live queries scored by a stronger LLM against rubric. Drift detection.
  • Shadow traffic. New embedding model or chunker runs in shadow mode against live queries before the swap — compare top-5 overlap and recall before promoting.

Worked example — building the offline ingest stage in 50 lines

Detailed explanation. The minimal happy path: pull a document, normalize, chunk with overlap, embed in a batch, upsert with metadata. Everything in production is a hardening of these five steps — retries, dedup, content-hash caching, sidecar persistence.

Question. Sketch the offline ingest stage that takes a Confluence page, splits it into 500-token chunks with 75-token overlap, embeds each chunk, and upserts into a vector store with a tenant-aware metadata sidecar.

Input. A single Confluence page object.

field value
page_id "PAGE-42"
tenant_id "acme"
body_markdown "## Refund policy\n\nAt Acme..." (~3500 tokens)
last_modified "2026-06-10T09:14:00Z"

Code.

import hashlib
from typing import Iterable

CHUNK_TOKENS = 500
OVERLAP_TOKENS = 75


def chunk_text(text: str, size: int, overlap: int) -> Iterable[str]:
    """Sliding-window chunker — tokens approximated as whitespace splits."""
    tokens = text.split()
    step = size - overlap
    for i in range(0, len(tokens), step):
        yield " ".join(tokens[i : i + size])
        if i + size >= len(tokens):
            break


def ingest_page(page: dict) -> None:
    chunks = list(chunk_text(page["body_markdown"], CHUNK_TOKENS, OVERLAP_TOKENS))
    # Batch-embed all chunks in one call — 10-50x cheaper than per-chunk
    vectors = embed_model.encode(chunks)
    rows = []
    for idx, (text, vec) in enumerate(zip(chunks, vectors)):
        chunk_id = f"{page['page_id']}#chunk-{idx}"
        rows.append({
            "id": chunk_id,
            "vector": vec,
            "metadata": {
                "tenant_id": page["tenant_id"],
                "source": f"confluence/{page['page_id']}",
                "document_id": page["page_id"],
                "last_modified": page["last_modified"],
                "chunk_idx": idx,
                "content_hash": hashlib.sha256(text.encode()).hexdigest(),
                "embed_model_id": embed_model.id,
            },
        })
        text_store.put(chunk_id, text)            # sidecar — chunk text in Postgres / S3
    vector_store.upsert(rows)                     # vectors + metadata
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The chunker is a sliding-window splitter — 500 tokens forward, 75 tokens back, so each chunk shares ~15% with its neighbours. Overlap is what saves you when the answer straddles a chunk boundary.
  2. embed_model.encode(chunks) is called once with the full batch. Per-chunk calls are 10-50x more expensive due to per-request overhead and HTTP round-trips.
  3. Every chunk gets a deterministic chunk_id derived from the document id and chunk index, so re-ingesting the same page produces the same IDs — upsert overwrites instead of duplicating.
  4. The metadata sidecar carries every filter the retrieval path will ever need: tenant_id (multi-tenancy), source and document_id (attribution), last_modified (freshness), chunk_idx (sibling lookup), content_hash (dedup / skip-unchanged), and embed_model_id (blue/green safety).
  5. The chunk text goes into a sidecar text store keyed by chunk_id. The vector store only stores the vector — sidecar lookup happens at the prompt-assembly stage.
  6. Idempotency: re-ingesting the same page is a no-op for unchanged chunks (same hash, same vector, same metadata) and a one-row update for changed chunks.

Output.

chunk_id tenant_id text snippet vector dims
PAGE-42#chunk-0 acme "## Refund policy At Acme..." 1536
PAGE-42#chunk-1 acme "...within 14 days of purchase..." 1536
PAGE-42#chunk-2 acme "...exceptions for digital goods..." 1536

Rule of thumb. Every chunk needs three things from day one: a deterministic ID, a content hash, and a metadata sidecar that includes tenant_id, source, and last_modified. Bolt them on later and you are rewriting the ingest job, not patching it.

Worked example — the online retrieval stage end-to-end

Detailed explanation. Once the ingest stage has populated the index, the online path is a tight five-step pipeline: embed query, ANN search with metadata filter, BM25 fusion, rerank, assemble prompt. The whole loop should run in <300ms P95 to feel responsive.

Question. Sketch the online retrieval stage that takes a user query, applies the tenant filter, fuses dense + BM25 scores, reranks the top-50 with a cross-encoder, and returns the top-5 chunks plus assembled prompt.

Input. A single user query plus the calling user's tenant and ACL list.

field value
query "What is Acme's refund window for digital goods?"
tenant_id "acme"
acl_ids ["public", "support"]

Code.

def retrieve_and_rerank(query: str, tenant_id: str, acl_ids: list[str]) -> dict:
    # 1) Embed the query with the SAME model used at ingest
    qvec = embed_model.encode([query])[0]

    # 2) ANN search with metadata filter — tenant + ACL pushdown
    candidates = vector_store.search(
        vector=qvec,
        top_k=50,
        filter={
            "tenant_id": tenant_id,
            "acl_ids": {"$in": acl_ids},
        },
    )

    # 3) Hybrid score fusion — Reciprocal Rank Fusion (RRF) with k=60
    bm25_hits = bm25_index.search(query, top_k=50, filter={"tenant_id": tenant_id})
    fused = rrf_fuse(candidates, bm25_hits, k=60)

    # 4) Cross-encoder rerank on the top 50 fused candidates
    pairs = [(query, text_store.get(c.id)) for c in fused[:50]]
    rerank_scores = reranker.score(pairs)
    top5 = sorted(zip(fused[:50], rerank_scores), key=lambda x: -x[1])[:5]

    # 5) Assemble the prompt with provenance headers
    blocks = []
    for cand, _ in top5:
        m = cand.metadata
        text = text_store.get(cand.id)
        blocks.append(
            f"[source: {m['source']} · modified {m['last_modified']}]\n{text}"
        )
    return {
        "prompt_context": "\n\n---\n\n".join(blocks),
        "citations": [c.metadata["source"] for c, _ in top5],
    }


def rrf_fuse(dense, sparse, k=60):
    """Reciprocal Rank Fusion — combine two ranked lists by 1/(k+rank)."""
    scores: dict[str, float] = {}
    for rank, hit in enumerate(dense):
        scores[hit.id] = scores.get(hit.id, 0) + 1 / (k + rank + 1)
    for rank, hit in enumerate(sparse):
        scores[hit.id] = scores.get(hit.id, 0) + 1 / (k + rank + 1)
    return sorted(
        {c.id: c for c in dense + sparse}.values(),
        key=lambda c: -scores[c.id],
    )
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The query is embedded with the same model that produced the index vectors. If the model IDs disagree, vectors are in different spaces and cosine similarity is noise — this is the silent killer covered in section 1.
  2. The ANN search pushes the tenant filter down into the index — it is not a post-filter. Pinecone, Qdrant, Weaviate, and pgvector all support metadata predicate pushdown, and using it is the difference between a 10ms query and a 500ms query at scale.
  3. The BM25 search runs in parallel against a sparse inverted index (Elasticsearch, OpenSearch, Lucene) with the same tenant filter. Dense + sparse together is the hybrid search shape.
  4. RRF fuses the two ranked lists by summing 1/(k+rank) — no need to normalize scores, no per-system weight tuning. k=60 is the standard starting value from the IR literature.
  5. The cross-encoder reranker takes (query, chunk_text) pairs and produces a precision-tuned score. It is 50-100x slower per pair than ANN, so it only runs on the top 50 — never the full corpus.
  6. Top-5 reranked chunks become the prompt context. Each block carries a source and last_modified header so the LLM can quote the source and the consumer can audit freshness.

Output.

step result
ANN top-50 (dense) 50 candidates
BM25 top-50 (sparse) 50 candidates
RRF fused ~70 unique candidates
Reranked top-5 5 chunks for prompt
Prompt context ~2500 tokens with citations

Rule of thumb. The online stage is a fixed five-step pipeline: embed → ANN+filter → BM25+fusion → rerank → assemble. Every step has a 100ms budget; every step has its own telemetry. If P95 latency drifts, the slow step is almost always the reranker — cap the rerank candidate set and tune k.

Worked example — the eval harness golden-set scoring loop

Detailed explanation. A golden Q&A set is the only reliable way to know whether a pipeline change improved or hurt retrieval. The shape: a CSV / Parquet of (question, expected_chunk_id, expected_answer) triples maintained by SMEs. The harness runs every nightly job and scores recall@k and MRR. A sustained drop pages the on-call.

Question. Sketch a nightly offline eval that scores the current pipeline against a golden set and reports recall@5 and mean reciprocal rank (MRR) at 10.

Input. A 500-row golden set.

question_id question expected_chunk_id
g001 "Refund window for digital goods?" PAGE-42#chunk-1
g002 "How long is the trial period?" PAGE-09#chunk-3

Code.

def score_pipeline(golden: list[dict], k: int = 5) -> dict:
    recall_hits = 0
    rr_sum = 0.0
    for q in golden:
        hits = retrieve_and_rerank(
            query=q["question"],
            tenant_id=q.get("tenant_id", "default"),
            acl_ids=q.get("acl_ids", ["public"]),
        )
        ids = [h.id for h in hits["chunks"][:10]]
        if q["expected_chunk_id"] in ids[:k]:
            recall_hits += 1
        try:
            rank = ids.index(q["expected_chunk_id"]) + 1
            rr_sum += 1 / rank
        except ValueError:
            pass  # not in top-10 → reciprocal rank contributes 0
    return {
        "recall_at_k": recall_hits / len(golden),
        "mrr_at_10": rr_sum / len(golden),
        "n": len(golden),
    }


# Nightly job
result = score_pipeline(load_golden_set(), k=5)
publish_metric("rag.recall_at_5", result["recall_at_k"])
publish_metric("rag.mrr_at_10", result["mrr_at_10"])
if result["recall_at_k"] < 0.85:
    page_oncall("recall@5 regression", result)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. For each golden question, the harness runs the real online retrieval pipeline — same embed, same filter, same rerank — against the live index. No mocking.
  2. recall_at_k is "did the gold chunk appear in the top-k?" averaged across the golden set. The standard production threshold is 0.85-0.95 depending on stakes.
  3. MRR at 10 is "the average of 1/rank over the golden set, with rank=∞ contributing 0." MRR rewards getting the right chunk high in the ranking — it is a precision-tilted recall metric.
  4. Both metrics are published to the metric store and trended over time. A sustained drop ≥5 points triggers PagerDuty and a rollback investigation.
  5. The harness is the single source of truth on whether a chunking change, an embedding model upgrade, or a reranker swap helped. Without it, every change is a vibe-based decision.

Output.

metric value threshold
recall@5 0.91 ≥ 0.85
MRR@10 0.74 ≥ 0.65
n 500

Rule of thumb. No golden set, no production RAG. Build a 200-row golden set on day one; grow it to 1000-2000 as edge cases emerge. The harness is what turns RAG from "vibes shipping" into a real data product with an SLO.

Data engineering interview question on online vs offline pipelines

A senior interviewer might frame this as: "Draw the offline ingest and online retrieval pipelines on a whiteboard, then mark where each one can fail at 3am and how you would alert on it." It probes both system design and on-call instincts.

Solution Using a two-pipeline diagram + failure-mode catalogue

# Two pipelines joined at the vector store
OFFLINE = [
    "source_connector",
    "normalize",
    "chunk",
    "embed",
    "vector_upsert",
]
ONLINE = [
    "query_embed",
    "ann_search_with_filter",
    "bm25_search",
    "rrf_fuse",
    "rerank",
    "assemble_prompt",
]

FAILURES = {
    "source_connector": "auth expired / source down → source_lag_seconds spike",
    "chunk": "boundary mid-sentence → recall@5 drop on golden set",
    "embed": "model API down → embed_queue_depth grows unbounded",
    "vector_upsert": "schema mismatch → upsert_error_rate spike",
    "ann_search_with_filter": "metadata not indexed → P95 latency spike",
    "rerank": "model timeout → fallback_ratio spike",
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Pipeline Stage Failure signal Alert
offline source_connector source_lag_seconds > 600 page
offline embed embed_queue_depth growing page
offline vector_upsert upsert_error_rate > 1% page
online ann_search P95 latency > 200ms page
online rerank fallback_ratio > 5% page
online assemble_prompt empty_context_ratio > 2% warn

The trace highlights that offline and online have orthogonal failure modes — an offline stall produces a staleness failure; an online stall produces a latency or quality failure. Each needs its own SLO.

Output:

Pipeline Top SLO Alert threshold
offline ingest P95 source-to-retrieval lag < 5 min
online retrieve P95 end-to-end latency < 300 ms
online quality recall@5 (nightly) ≥ 0.85

Why this works — concept by concept:

  • Two pipelines, one vector store — the offline path is throughput-bound (catch up on backlog); the online path is latency-bound (respond in <300ms). Splitting them lets you scale the embed worker independently from the retrieval service.
  • SLO per pipeline — the offline SLO is a freshness number (source_lag_seconds); the online SLO is a latency number (P95_response_time); the quality SLO is an offline metric (recall@5). All three matter.
  • Filter pushdown — the metadata filter (tenant_id, acl_ids) is pushed into the ANN search, not applied after. This is the single biggest performance lever in the online path.
  • Reranker is the precision lever — without it, you get hybrid-search recall but consumer-noisy ranking. With it, top-5 precision climbs sharply at the cost of ~50-100ms.
  • Cost — offline: O(docs) one-time + O(changes) per CDC tick + O(chunks) embedding cost. Online: one embed call + one ANN query + one BM25 query + one rerank batch per request. Reranker dominates online cost; embedding dominates offline cost.

DE
Topic — streaming
Streaming pipeline problems (DE)

Practice →


3. Chunking strategies — fixed, semantic, recursive, hierarchical

chunking strategies decide what the retriever can ever return — pick the strategy per content type, not per project

The mental model in one line: a chunk is the smallest unit your retriever can return, so the chunk shape is the upper bound on retrieval precision — split mid-sentence and the model gets half an idea; split mid-paragraph and you lose the connective tissue between claims. Once you say "the chunk is the retrieval unit," you stop optimising token counts and start optimising meaningful boundaries.

Four-strategy chunking comparison — top-left fixed-window strategy shown as a long document sliced into equal rectangles, top-right recursive splitter shown as a tree splitting paragraph→sentence→token, bottom-left semantic chunking shown as a similarity curve with split markers at drops, bottom-right hierarchical parent-child shown as small child-chunk tiles linked to a larger parent card; an overlap-window pill sits at the centre, on a light PipeCode card.

The four families.

  • Fixed token windows. Pick a target size (e.g. 500 tokens) and a stride. Cheap, deterministic, dumb at boundaries. Default starting point — beat it before adding complexity.
  • Recursive character / token splitter. Try to split by paragraph; if too big, fall back to sentence; if still too big, fall back to a fixed token window. LangChain's RecursiveCharacterTextSplitter popularised this shape.
  • Semantic chunking. Embed each sentence and split where the cosine similarity between adjacent sentences drops below a threshold — i.e. where the topic visibly shifts. Higher quality, ~2-3x more expensive at ingest.
  • Hierarchical / parent-child. Index small child chunks (for high-recall retrieval) but at prompt time return the parent chunk (a paragraph or section) that contains the matched child. The model gets context; the retriever gets precision.

Overlap windows.

  • What overlap is. Each chunk shares its first N tokens with the previous chunk's last N tokens. 10-20% is the standard range (~50-100 tokens for a 500-token chunk).
  • Why it matters. A claim that straddles a chunk boundary appears in both chunks instead of being orphaned. Overlap is the cheapest insurance against boundary loss.
  • When to skip it. Atomic content (a code block, a table, a paragraph) does not need overlap — it is already its own unit.

Per-content-type strategy table.

Content type Recommended strategy Notes
Prose / docs recursive or semantic, 400-600 tokens, 15% overlap semantic if budget allows
Code one chunk per function / class (AST split) never split mid-function
Tables / structured one chunk per table or per row group preserve column headers
Transcripts / chat one chunk per turn or per N seconds speaker label as metadata
FAQs one chunk per Q+A pair the Q is the retrieval signal
Long PDFs (manuals) hierarchical: child = paragraph, parent = section retrieve child, serve parent

Common chunking interview probes.

  • "What is the right chunk size?" — there is no universal right; start at 500 tokens with 75 overlap, then tune against the golden set. Senior signal: name "embedding model context window" as the hard upper bound (most embedding models are 512 tokens) and "downstream LLM context budget" as the soft constraint.
  • "When do you use semantic chunking?" — when prose has shifting topics within a single document (long-form blogs, transcripts, multi-topic reports). Skip it for tightly-scoped reference docs where fixed windows match natural paragraph length.
  • "What is parent-child chunking?" — index small chunks for high-recall ANN, but at prompt time return the parent chunk (a paragraph, section, or sliding window) so the LLM gets enough context. Standard shape on long-form RAG.
  • "How do you chunk code?" — never mid-function. Use the language's AST to split at function and class boundaries; index the function signature + docstring + body as one chunk.

Worked example — fixed-window vs recursive splitter on the same document

Detailed explanation. A fixed-window splitter is the simplest possible chunker — slide a window of N tokens with stride S. It is fast, but it cheerfully cuts sentences and paragraphs in half. A recursive splitter tries semantic boundaries first (paragraphs, then sentences) before falling back to a fixed window, so most splits land at natural boundaries.

Question. Implement a fixed-window splitter and a recursive splitter for a 1200-token document. Compare the splits and show why the recursive version preserves more semantic units.

Input. A document with three paragraphs (400 + 350 + 450 tokens, separated by \n\n).

paragraph tokens
P1 — refund policy 400
P2 — exceptions 350
P3 — appeals 450

Code.

def fixed_window(text: str, size: int = 500, overlap: int = 75) -> list[str]:
    tokens = text.split()
    out, step = [], size - overlap
    for i in range(0, len(tokens), step):
        chunk = tokens[i : i + size]
        if not chunk:
            break
        out.append(" ".join(chunk))
        if i + size >= len(tokens):
            break
    return out


def recursive_split(text: str, size: int = 500, overlap: int = 75) -> list[str]:
    """Try paragraph → sentence → fixed-window fallback."""
    # 1) Try paragraph split first
    paras = text.split("\n\n")
    out: list[str] = []
    for p in paras:
        ptokens = p.split()
        if len(ptokens) <= size:
            out.append(p)
            continue
        # 2) Paragraph too big → fall back to sentence split
        sents = re.split(r"(?<=[.!?])\s+", p)
        buf: list[str] = []
        buf_tok = 0
        for s in sents:
            st = s.split()
            if buf_tok + len(st) > size and buf:
                out.append(" ".join(buf))
                buf, buf_tok = [], 0
            buf.append(s)
            buf_tok += len(st)
        if buf:
            out.append(" ".join(buf))
        # 3) Sentence-level chunks still too big? fall back to fixed window
        out = [c if len(c.split()) <= size else fixed_window(c, size, overlap)
               for c in out]
    # Flatten any nested lists from step 3
    flat: list[str] = []
    for c in out:
        flat.extend(c) if isinstance(c, list) else flat.append(c)
    return flat
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Fixed-window on the 1200-token document with size=500, overlap=75 produces three chunks: tokens 0-500, 425-925, 850-1200. The first chunk ends mid-paragraph 2 (token 500 is inside P2). The second chunk starts mid-paragraph 2. Boundary loss.
  2. Recursive split tries paragraph boundaries first. P1 (400 tokens) is ≤500, so it becomes one chunk. P2 (350 tokens) is ≤500, so it becomes one chunk. P3 (450 tokens) is ≤500, so it becomes one chunk. Three clean paragraph-aligned chunks, zero mid-sentence splits.
  3. If a paragraph exceeds 500 tokens, the recursive splitter falls back to sentence split — still semantic, just one level finer. Only paragraphs and sentences that exceed the size fall back to the dumb fixed-window splitter.
  4. The recursive version produces more chunks on average (one per paragraph instead of one per 500-token window), but every chunk respects a natural boundary. Retrieval precision improves at the cost of slightly more chunks to index.

Output.

Strategy # chunks mid-sentence splits
fixed_window(500, 75) 3 2 (inside P2)
recursive_split(500, 75) 3 0

Rule of thumb. Default to recursive splitting for prose — the cost over a fixed window is negligible (one extra pass over the text) and the recall lift is consistent. Reserve plain fixed-window for tightly-controlled content types where the natural unit is already the window size (timeseries, log lines, code lines).

Worked example — semantic chunking by similarity drop

Detailed explanation. Semantic chunking embeds each sentence and walks the document looking for drops in cosine similarity between adjacent sentences — those drops mark topic shifts. A chunk is the run of sentences between two drops. The cost is one extra embedding call per sentence at ingest, but the chunks line up with topical boundaries that no syntactic splitter can find.

Question. Implement a semantic chunker that splits a document at sentence boundaries where the cosine similarity between consecutive sentences drops below a threshold.

Input. A 6-sentence document that shifts topic twice.

sentence_idx text
0 "Refund policy: we accept returns within 14 days."
1 "Returns must be in original packaging."
2 "Shipping is free on orders over $50."
3 "We use FedEx and UPS for delivery."
4 "Our support hours are 9 to 5 EST."
5 "Reach us via email or chat."

Code.

def semantic_chunk(text: str, threshold: float = 0.55) -> list[str]:
    sents = re.split(r"(?<=[.!?])\s+", text)
    if len(sents) < 2:
        return sents

    vecs = embed_model.encode(sents)
    boundaries = [0]  # split *before* these indices

    for i in range(1, len(vecs)):
        sim = cosine(vecs[i - 1], vecs[i])
        if sim < threshold:
            boundaries.append(i)
    boundaries.append(len(sents))

    chunks: list[str] = []
    for a, b in zip(boundaries[:-1], boundaries[1:]):
        chunks.append(" ".join(sents[a:b]))
    return chunks


def cosine(a, b):
    import math
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(y * y for y in b))
    return dot / (na * nb + 1e-9)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Each sentence is embedded once at ingest time — this is the cost driver. For a 100-sentence document, semantic chunking issues 100 embed calls instead of zero (fixed-window).
  2. Adjacent sentences are compared pairwise. The cosine similarity between sentence 1 ("packaging") and sentence 2 ("shipping") is below 0.55 → boundary inserted before sentence 2.
  3. The walk continues: sentence 3 vs 4 similar (both about shipping), no split. Sentence 4 vs 5 below 0.55 (shipping → support hours), boundary inserted.
  4. Three chunks emerge: [0,1] refund/returns, [2,3] shipping, [4,5] support. Each chunk is a topical unit, even though they all sit in one source document.
  5. Threshold is tuned on the golden set. Too high → too many chunks (every sentence is its own chunk); too low → no splits at all. 0.45-0.65 is the typical range with OpenAI / Cohere embeddings.

Output.

chunk_idx sentences topic
0 0, 1 refund / returns
1 2, 3 shipping
2 4, 5 support

Rule of thumb. Use semantic chunking when documents mix topics (long-form blogs, all-hands transcripts, multi-section policy docs). Skip it when each source document is already a single tight topic — the marginal recall gain does not pay for the ingest cost.

Worked example — hierarchical (parent-child) chunking

Detailed explanation. Index small chunks (sentences or short paragraphs) for high-recall ANN matching, but at prompt assembly time return the parent chunk (a section or full paragraph) that contains the matched child. The retriever gets to match on tight semantic units; the LLM gets enough surrounding context to answer.

Question. Implement a parent-child chunker that indexes sentence-level child chunks but returns the paragraph-level parent at retrieval time.

Input. A 3-paragraph document where each paragraph has 2-3 sentences.

paragraph_idx sentence_idx text
0 0 "Refunds are accepted within 14 days."
0 1 "Returns must be in original packaging."
1 0 "Digital goods are non-refundable."
1 1 "Subscriptions cancel at period end."
2 0 "Contact support at help@acme.com."

Code.

def parent_child_chunk(doc_id: str, text: str) -> tuple[list[dict], dict]:
    parents: dict[str, str] = {}
    children: list[dict] = []

    paras = text.split("\n\n")
    for p_idx, para in enumerate(paras):
        parent_id = f"{doc_id}#para-{p_idx}"
        parents[parent_id] = para

        sents = re.split(r"(?<=[.!?])\s+", para)
        for s_idx, sent in enumerate(sents):
            children.append({
                "id": f"{doc_id}#sent-{p_idx}-{s_idx}",
                "text": sent,
                "parent_id": parent_id,
            })
    return children, parents


def retrieve_with_parent(query: str, top_k: int = 5) -> list[str]:
    # 1) ANN search over the child (sentence) index
    child_hits = vector_store.search(query=query, top_k=top_k * 3)

    # 2) Resolve to parent chunks, dedupe — same parent counted once
    seen, out = set(), []
    for c in child_hits:
        pid = c.metadata["parent_id"]
        if pid in seen:
            continue
        seen.add(pid)
        out.append(parent_store.get(pid))
        if len(out) == top_k:
            break
    return out
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. At ingest, every sentence becomes a child chunk with a vector, and every paragraph becomes a parent record (text only — no vector needed because parents are never directly searched).
  2. Each child carries a parent_id in its metadata sidecar so the retrieval path can resolve from match back to context.
  3. At query time, the ANN search runs over the child index — short, semantically tight units that match queries crisply.
  4. The result list of child matches is collapsed by parent_id (dedupe) and the parent paragraph text is returned. If sentences 0 and 1 of paragraph 0 both match, the paragraph is returned once.
  5. top_k * 3 over-retrieves on the child side because multiple children from the same parent may match — over-fetch lets you still emit top_k unique parents.

Output (for query "How long for refunds?").

match child_id parent returned
1st hit doc#sent-0-0 "Refunds are accepted within 14 days. Returns must be in original packaging."

Rule of thumb. Use parent-child when chunks need to be small enough for tight retrieval (sentence-level) but the LLM needs surrounding context to answer (paragraph or section level). It is the default shape for technical docs, legal docs, and long-form policy.

Worked example — overlap window math and why 15% is the default

Detailed explanation. Overlap is the cheapest insurance against a fact landing across a chunk boundary. The math is simple: with overlap O on a chunk of size S, every claim within the first O tokens of a chunk also appears in the previous chunk; every claim within the last O tokens also appears in the next chunk. A claim has two shots at being returned.

Question. Given chunk size 500 tokens and overlap 75 tokens, what fraction of the document appears in two chunks? When is overlap worth it and when is it waste?

Input. A document of 5000 tokens.

Code.

def overlap_stats(doc_tokens: int, size: int, overlap: int) -> dict:
    step = size - overlap
    n_chunks = (doc_tokens - overlap) // step + (1 if (doc_tokens - overlap) % step else 0)
    total_tokens_stored = n_chunks * size
    redundant = total_tokens_stored - doc_tokens
    return {
        "n_chunks": n_chunks,
        "total_tokens_stored": total_tokens_stored,
        "redundant_tokens": redundant,
        "overlap_pct": overlap / size,
        "storage_overhead_pct": redundant / doc_tokens,
    }


for overlap in (0, 50, 75, 100, 200):
    print(overlap, overlap_stats(5000, 500, overlap))
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Step = size - overlap = 500 - 75 = 425. To cover 5000 tokens, you need ceil((5000 - 75) / 425) ≈ 12 chunks (vs 10 chunks with zero overlap).
  2. Total tokens stored = 12 × 500 = 6000. Doc has 5000 unique tokens. Redundant tokens = 1000, which is exactly 2 × (chunks - 1) × overlap = 2 × 11 × ... approximated by (n_chunks - 1) * overlap overhead.
  3. Storage overhead at 15% overlap is ~20% extra tokens stored (and embedded, and indexed). That is the cost.
  4. The benefit: every fact within 75 tokens of a chunk boundary now appears in two chunks, doubling its chance of being retrieved.
  5. 15% (~75 tokens on a 500-token chunk) is the empirical sweet spot — below 10% leaves obvious boundary gaps; above 25% pays cost without much extra recall.

Output.

overlap n_chunks redundant tokens storage overhead
0 10 0 0%
50 11 500 10%
75 12 1000 20%
100 13 1500 30%
200 17 3500 70%

Rule of thumb. Start at 15% overlap (75 tokens on a 500-token chunk) — it is the empirically calibrated default. Drop to 0 only for atomic content (a row, a function, a Q+A pair) where there is no boundary to worry about. Past 25% you pay storage and ingest cost without commensurate recall gain.

Data engineering interview question on picking a chunking strategy

A senior interviewer might frame this as: "You inherit a RAG system with 100k mixed documents — long-form policy PDFs, support tickets, code samples, transcripts. The current chunker is a fixed 1000-token window with no overlap. Recall@5 is 0.62 on the golden set. Where do you start?"

Solution Using a per-content-type chunking strategy

def chunk_dispatch(doc: dict) -> list[dict]:
    """Pick the right chunker per content type — strategy is not one-size-fits-all."""
    content_type = doc["content_type"]
    if content_type == "policy_pdf":
        # Long-form, multi-topic → hierarchical (sentence child, paragraph parent)
        return parent_child_chunker(doc, child_size=120, parent_size=600, overlap=15)
    if content_type == "support_ticket":
        # Short, conversational → one chunk per ticket, no overlap
        return [{"text": doc["body"], "metadata": doc["metadata"]}]
    if content_type == "code":
        # AST-aware split at function / class boundaries
        return ast_chunker(doc, language=doc["language"])
    if content_type == "transcript":
        # Topic-shift aware → semantic chunking
        return semantic_chunker(doc, threshold=0.55)
    # Default for plain prose
    return recursive_chunker(doc, size=500, overlap=75)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

content_type strategy starting params expected recall@5 lift
policy_pdf hierarchical child 120 / parent 600 +0.15-0.20
support_ticket atomic one chunk per ticket +0.05-0.10
code AST split per function / class +0.10-0.15
transcript semantic threshold 0.55 +0.10-0.15
plain prose recursive 500 / 75 overlap +0.05 baseline

The trace highlights that a single strategy across all content types is the bug, not a missing feature. Switching from fixed-1000 to per-type strategies typically lifts recall@5 by 0.15-0.25 in aggregate — the single largest one-shot win in any RAG project.

Output:

Metric Before (fixed 1000) After (per-type)
recall@5 0.62 ~0.83
MRR@10 0.41 ~0.63
avg chunks per doc 5.2 11.7

Why this works — concept by concept:

  • Per-content-type dispatch — the chunk shape that maximises recall depends entirely on the content shape. Code wants AST boundaries; policy wants section boundaries; tickets are already atomic; transcripts shift topic. One chunker for all four is a contradiction.
  • Hierarchical for long-form — small children give tight retrieval; large parents give the LLM enough context. The standard winning shape for technical docs.
  • Semantic for shifting topics — costs more at ingest but pays back on multi-topic documents (transcripts, blogs). Skip on tight-scope docs.
  • AST for code — splitting mid-function loses the function signature or the body. AST-aware chunking pairs them — interviewers love this answer because it shows content-aware thinking.
  • Cost — fixed-window is O(doc_tokens); recursive adds ~1 pass; semantic adds one embed call per sentence; hierarchical doubles the storage (children + parents). The recall lift typically pays back within 1 week of token cost.

DE
Topic — data transformation
Data transformation problems (DE)

Practice →


4. Embeddings + storage — choosing models and shaping the index

embeddings decide what "similar" means in your index — and the metadata sidecar decides whether the right user is allowed to see it

The mental model in one line: an embedding model is a learned similarity function — it determines which chunks the retriever calls "close" — and a vector stores schema is the index plus the metadata sidecar that keeps that similarity safe to ship. Once you say "the embedding decides recall, the metadata decides safety," every RAG storage question fits into one of those two columns.

Embedding and storage diagram — left zone shows three labelled embedding model cards (OpenAI, Cohere, OSS), middle zone shows a transformation arrow into vectors and a metadata sidecar card with tenant_id / source / ACL chips, right zone shows a vector store hexagon with a small fusion ribbon labelled 'dense + BM25' feeding a reranker card above, on a light PipeCode card.

Embedding model selection — the four axes.

  • Quality (recall@k on MTEB). OpenAI text-embedding-3-large, Cohere embed-v3, and OSS models like BGE-large and e5-mistral lead the public benchmarks. The gap between top-tier hosted and top OSS narrowed sharply in 2025-2026.
  • Dimensions. 256-dim models are 6x cheaper to store and search than 1536-dim models, with single-digit-point recall trade-off. Most production teams ship 384-768 dim now.
  • Cost. Per-million-token embedding cost. Hosted models charge ~$0.02-0.13 per million tokens; self-hosted OSS is GPU-amortised.
  • Latency. Per-batch latency at ingest. Critical for the freshness path — a stalled embedder is the most common cause of rag freshness SLO misses.

The "same model on read and write" rule.

  • The vectors in the index were produced by model M; a query must be embedded by model M to be comparable. Different models live in different vector spaces — cosine similarity between them is noise.
  • The fix when you want to upgrade: do a full re-embed into a new collection (covered in section 5 under blue/green).
  • The pre-flight check: store embed_model_id in the metadata sidecar of every chunk, and assert it matches the query embedder at retrieval time. Cheap insurance against a silent bug.

Batch vs streaming embedding.

  • Batch. A nightly job re-embeds all changed chunks. Fine for low-freshness use cases (knowledge base of reference docs). Cheaper per-token.
  • Streaming. CDC → Kafka → embed worker → vector upsert. Sub-minute lag. Required for high-freshness use cases (ticket triage, live policy lookup).
  • Hybrid. Batch for bulk re-embeds; streaming for incremental change. Most production systems converge to this shape.

Metadata schema — the columns every chunk carries.

  • tenant_id — multi-tenant filter pushdown. Required.
  • source — origin URI for attribution and trust signals.
  • document_id — group children back to parent doc.
  • last_modified — for freshness SLO and recency filters.
  • acl_ids — list of permission tags intersected with the user's permissions.
  • content_typeprose, code, table, transcript — drives content-specific behaviour.
  • embed_model_id — embedding model version. The blue/green safety pin.
  • content_hash — SHA-256 of chunk text. Powers skip-unchanged and dedupe.

Hybrid retrieval — dense + BM25.

  • Dense alone. Great on semantic synonyms ("revenue" matches "income"). Misses on rare keywords ("RT-2718 error code") because rare tokens have weak embeddings.
  • BM25 alone. Great on rare keywords. Misses on synonyms. Sensitive to vocabulary mismatch.
  • Hybrid (RRF or weighted). Recovers both regimes. The 2026 default for production RAG. The typical weighted formula: 0.6 * dense_score + 0.4 * bm25_score after both are min-max normalised; the typical rank-based formula is Reciprocal Rank Fusion with k=60.
  • When weighted vs RRF. RRF when scores from the two systems are not directly comparable (different score scales); weighted when you have tuned weights from a held-out set.

Reranking — the precision multiplier.

  • Bi-encoder (ANN). The embedding model is a bi-encoder: query and doc are embedded separately and compared by cosine. Fast (millions of vectors per second) but loses precision because the comparison is just a dot product.
  • Cross-encoder (reranker). A second model that takes (query, doc_text) as a pair and outputs a relevance score. Far more expressive (full attention across query + doc) but slow.
  • The standard shape. ANN top-N → reranker → top-k. N is typically 50; k is typically 5. The reranker only fires on 50 candidates per query, so the wall-clock cost is bounded.
  • Models. Cohere rerank-v3, BGE-reranker, mxbai-rerank. Hosted models add ~50-100ms; self-hosted are 30-200ms depending on GPU and batch size.

Worked example — picking embedding dimensions: 256 vs 768 vs 1536

Detailed explanation. Higher-dimension embeddings can express more nuance but pay for it in storage (4 bytes per dim × N chunks), index size, and query latency. Most teams over-pay for dimensions; a calibrated dim choice is one of the highest-leverage decisions in a RAG project.

Question. A team has 10M chunks. Compare storage and query cost across 256, 768, and 1536 dimensions. Show the ANN recall trade-off.

Input. 10M chunks, all float32 vectors.

Code.

def storage_estimate(n_chunks: int, dims: int) -> dict:
    bytes_per_chunk = dims * 4   # float32
    raw_bytes = n_chunks * bytes_per_chunk
    return {
        "dims": dims,
        "bytes_per_chunk": bytes_per_chunk,
        "raw_storage_gb": raw_bytes / 1e9,
        "approx_search_relative_cost": dims,
    }


for dims in (256, 384, 768, 1024, 1536):
    print(storage_estimate(10_000_000, dims))
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. 10M chunks × 1536 dims × 4 bytes = 61.4 GB raw vectors. At 768 dims → 30.7 GB. At 256 dims → 10.2 GB. Six-fold storage difference between the extremes.
  2. ANN search cost is roughly proportional to dims for the distance computation (HNSW edge comparisons scale linearly). A 1536-dim query is ~6x more compute per node visited than a 256-dim query.
  3. Recall trade-off (from MTEB benchmarks): going from 1536 → 768 typically costs 0-3 points of recall@10. Going to 384 costs 3-6 points. Going to 256 costs 5-10 points but pays back 6x on storage and search.
  4. Matryoshka embeddings (variable-dim truncation, where lower-dim prefixes are themselves usable embeddings) let you store at 1536 and search at 384 — a popular 2026 pattern.

Output.

dims raw storage (10M) rel. search cost typical recall lost
256 10.2 GB 1.0× 5-10 pts
384 15.4 GB 1.5× 3-6 pts
768 30.7 GB 3.0× 0-3 pts
1024 40.9 GB 4.0× 0-2 pts
1536 61.4 GB 6.0× reference

Rule of thumb. Default to 384-768 dims for production RAG; reserve 1536 for cases where the golden-set recall difference justifies the 6x storage and search cost. Matryoshka embeddings (store 1536, search 384) are the best of both worlds when the embedding model supports them.

Worked example — hybrid retrieval with Reciprocal Rank Fusion

Detailed explanation. Dense and sparse retrieval miss on orthogonal vocabulary regimes. Reciprocal Rank Fusion combines them by rank rather than score — no need to normalise scales, no per-system weight tuning. The 2026 default for hybrid search.

Question. Implement RRF that fuses a dense ANN result list and a BM25 result list into a single ranked output. Compare with a naive weighted-score fusion on a small example.

Input. Dense and sparse top-5 for query "refund window digital".

rank dense_id dense_score sparse_id sparse_score
1 A 0.92 C 18.4
2 B 0.88 A 17.1
3 C 0.71 D 14.0
4 D 0.65 B 12.5
5 E 0.62 F 9.8

Code.

def rrf(dense: list[str], sparse: list[str], k: int = 60) -> list[str]:
    scores: dict[str, float] = {}
    for rank, doc_id in enumerate(dense):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    for rank, doc_id in enumerate(sparse):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=lambda d: -scores[d])


def weighted(dense, sparse, w_dense=0.6, w_sparse=0.4):
    # Both scores must be min-max normalised first
    def norm(items, key):
        vals = [v for _, v in items]
        lo, hi = min(vals), max(vals)
        return {k_: (v - lo) / (hi - lo + 1e-9) for k_, v in items}

    d = norm(dense, "dense")
    s = norm(sparse, "sparse")
    ids = set(d) | set(s)
    return sorted(ids, key=lambda i: -(w_dense * d.get(i, 0) + w_sparse * s.get(i, 0)))


dense_ids = ["A", "B", "C", "D", "E"]
sparse_ids = ["C", "A", "D", "B", "F"]
print("RRF:", rrf(dense_ids, sparse_ids, k=60))
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. RRF assigns each candidate a score of 1/(k + rank) for each list it appears in. Lower ranks get more weight. The score for a doc that appears in both lists is the sum of the two contributions.
  2. For doc A: rank 1 in dense (score 1/61 = 0.0164), rank 2 in sparse (score 1/62 = 0.0161). Total = 0.0325.
  3. For doc C: rank 3 in dense (1/63 = 0.0159), rank 1 in sparse (1/61 = 0.0164). Total = 0.0323.
  4. A wins narrowly over C because it ranked higher in dense; both win over docs that appear in only one list because they get only one contribution.
  5. RRF needs no score normalisation. Weighted score fusion requires normalising the dense cosine similarities and the BM25 scores to a common range and tuning the weights on a held-out set. RRF is robust to both.
  6. k=60 is the value from the original RRF paper. Smaller k weights the top of each list more aggressively; larger k flattens the contribution. The default works well in practice.

Output (RRF top-5).

rank doc rrf_score
1 A 0.0325
2 C 0.0323
3 B 0.0318
4 D 0.0314
5 E 0.0164

Rule of thumb. Default to RRF for hybrid retrieval — it is robust, requires no tuning, and works whenever you can produce two ranked lists. Switch to weighted score fusion only if you have a held-out tuning set and the RRF result is leaving recall on the table.

Worked example — metadata pushdown vs post-filter

Detailed explanation. Multi-tenant RAG must restrict every retrieval to the calling tenant. Where the filter is applied matters enormously — pushdown into the ANN index is cheap; post-filter after ANN is catastrophic at scale.

Question. Show two implementations of a tenant-aware retrieve: one with metadata pushdown into the vector store, one with a post-filter on the application side. Quantify the cost difference.

Input. A 50M-chunk index where each tenant has ~50k chunks (1000 tenants).

Code.

# BROKEN — post-filter on the application side
def retrieve_postfilter(query: str, tenant_id: str, k: int = 5) -> list:
    qvec = embed_model.encode([query])[0]
    candidates = vector_store.search(vector=qvec, top_k=2000)  # huge over-fetch
    return [c for c in candidates if c.metadata["tenant_id"] == tenant_id][:k]

# CORRECT — push the filter down into the ANN search
def retrieve_pushdown(query: str, tenant_id: str, k: int = 5) -> list:
    qvec = embed_model.encode([query])[0]
    return vector_store.search(
        vector=qvec,
        top_k=k,
        filter={"tenant_id": tenant_id},
    )
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. With 50M chunks and 1000 tenants, a tenant's chunks are 0.1% of the index. To get a top-5 after filtering, the post-filter approach must over-fetch by a huge factor — empirically 2000-10000 candidates to reliably surface 5 from the target tenant.
  2. ANN search cost is O(log N) per query, but at large over-fetch the constants matter — fetching 10000 vs 5 candidates is roughly 100-1000x the wall-clock at scale.
  3. Pushdown lets the ANN index restrict the search graph traversal to nodes that match the filter — modern vector stores (Pinecone, Qdrant, Weaviate, pgvector with vchord / pgvecto.rs) support this natively.
  4. Worst case for post-filter: the user has a "needle in haystack" tenant where the top 2000 ANN candidates contain zero of the tenant's chunks → user gets empty results despite having matching chunks in the index.
  5. Pushdown is also the only safe form — post-filter is a defence-in-depth layer at best, not the primary tenant boundary.

Output.

Approach Avg fetched P95 latency tenant safety
post-filter 2000-10000 400-2000 ms weak
pushdown k (5) 15-50 ms strong

Rule of thumb. Always push tenant and ACL filters down into the vector store; never rely on application-side post-filter. The performance difference is 1-2 orders of magnitude, and the safety boundary lives in exactly one place — the vector store query itself.

Worked example — cross-encoder reranker on top of hybrid

Detailed explanation. A cross-encoder reranker is the standard precision multiplier on top of hybrid retrieval. The shape is fixed: hybrid produces 50 candidates → reranker scores all 50 in one batch → top-5 are the prompt context. Adds ~50-100ms; lifts top-5 precision dramatically.

Question. Add a Cohere-style reranker to the hybrid pipeline. Show the batch call shape, the latency budget, and what happens when the reranker times out.

Input. A query plus 50 hybrid-fused candidate chunks.

Code.

def hybrid_then_rerank(query: str, tenant_id: str, k: int = 5) -> list:
    # 1) Dense + sparse → fuse → 50 candidates
    candidates = hybrid_search(query, tenant_id, top_k=50)
    texts = [text_store.get(c.id) for c in candidates]

    # 2) Cross-encoder batch — single call with all 50 pairs
    try:
        scores = reranker.score_batch(
            query=query, documents=texts, timeout_ms=120
        )
    except TimeoutError:
        # 3) Graceful degradation — return top-k from hybrid without rerank
        log.warn("reranker_timeout", query_hash=hash(query))
        metric("rag.rerank_timeout", 1)
        return candidates[:k]

    # 4) Sort by reranker score; return top-k
    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c for c, _ in ranked[:k]]
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The reranker call is batched — all 50 (query, doc) pairs go in one HTTP request. Per-pair calls would multiply latency by ~50.
  2. Latency budget: 120ms timeout. Cohere rerank-v3 typically lands at 60-100ms for 50 pairs; budget the timeout above the P99 to avoid spurious timeouts.
  3. Graceful degradation: if the reranker times out, fall back to the hybrid top-k. Never fail the whole query — degrade to "less precise but still answered."
  4. Metric rag.rerank_timeout is graphed and alerted on. A sustained timeout rate (≥5%) is the first sign the reranker is overloaded or the model API is unhealthy.
  5. The reranker is the only online step that can be hot-swapped. Promote a new reranker behind a feature flag, run it shadow-style for a week against the golden set, then flip.

Output.

step latency budget typical actual
query embed 30 ms 10-25 ms
ANN search (filtered) 40 ms 15-30 ms
BM25 search 30 ms 10-20 ms
RRF fuse 5 ms <2 ms
reranker batch (50) 120 ms 60-100 ms
prompt assemble 20 ms 5-15 ms
total < 300 ms 100-200 ms

Rule of thumb. Budget the reranker on its own line. Cap candidate count at 50 (sometimes 100 for very high-stakes queries). Always wrap the call with a timeout and a graceful-degradation fallback to "hybrid without rerank" — never fail the user-facing query because the reranker hiccupped.

Data engineering interview question on storage schema and hybrid retrieval

A senior interviewer might frame this as: "You are designing the vector store schema for a multi-tenant retrieval augmented generation service. Walk me through the schema, why each column exists, and how the online retrieval query uses every column."

Solution Using a metadata-rich schema and pushdown filters

# Vector store row schema — every column has a job
SCHEMA = {
    "id": "text PRIMARY KEY",         # deterministic chunk_id
    "vector": "vector(768)",          # embedding
    "tenant_id": "text NOT NULL",     # multi-tenant pushdown
    "source": "text NOT NULL",        # attribution
    "document_id": "text NOT NULL",   # parent doc grouping
    "chunk_idx": "int NOT NULL",      # sibling lookup
    "content_type": "text",           # prose / code / table / transcript
    "last_modified": "timestamptz",   # freshness SLO + recency filter
    "acl_ids": "text[]",              # permission tags
    "embed_model_id": "text",         # blue/green safety pin
    "content_hash": "text",           # skip-unchanged + dedupe
}

# The retrieval query — every metadata column is exercised
def retrieve(query, tenant_id, user_acl, max_age_days):
    qvec = embed_model.encode([query])[0]
    return vector_store.search(
        vector=qvec,
        top_k=50,
        filter={
            "tenant_id": tenant_id,
            "acl_ids": {"$overlap": user_acl},
            "last_modified": {"$gte": days_ago(max_age_days)},
            "embed_model_id": embed_model.id,
        },
    )
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Column Used by
id upsert idempotency, sidecar text lookup
vector ANN cosine similarity
tenant_id filter pushdown — multi-tenant safety
source prompt attribution, trust signal
document_id parent-chunk resolution, "show full doc"
chunk_idx sibling lookup ("read the next chunk too")
content_type content-aware rerank or rules
last_modified freshness SLO, recency filter
acl_ids permission pushdown (overlap match)
embed_model_id blue/green safety — match read model
content_hash skip-unchanged on reingest, dedupe

The trace highlights that every metadata column earns its keep — none are decorative. Drop tenant_id and you have a cross-tenant leakage bug; drop embed_model_id and you have a silent model-mismatch bug; drop acl_ids and you have an authz failure.

Output:

Decision Rationale
768 dims balance recall vs storage
pgvector or Qdrant pushdown filter support
ACL as array overlap-match against user's permissions
content_hash indexed re-ingest skips unchanged chunks
last_modified indexed freshness queries are common

Why this works — concept by concept:

  • Metadata pushdown is the single biggest perf lever — restricting the ANN search to the tenant's slice changes the constants by 1-2 orders of magnitude at scale.
  • embed_model_id is the silent-bug pin — the most subtle RAG failure mode is read and write paths using different embedding models; the metadata column makes the mismatch detectable in seconds.
  • content_hash is the skip-unchanged pin — letting the ingest job re-embed only changed chunks turns nightly reindex from a multi-hour batch into a sub-minute incremental.
  • ACL overlap match — modern vector stores support array-overlap as a filter primitive. Permission pushdown then runs at the same speed as tenant pushdown.
  • Cost — schema overhead is ~200 bytes per chunk for metadata; the index on tenant_id and last_modified is the difference between a 200ms and a 20ms query at 50M chunks. Every byte earns its keep.

DE
Topic — indexing
Indexing problems (DE)

Practice →


5. Freshness, reindex, and ACLs

rag freshness is an SLO, not a feature — and the reindex playbook is what keeps it honest

The mental model in one line: freshness is the P95 lag between a source document being updated and the corresponding chunk being retrievable — and a production RAG pipeline either ships an explicit freshness SLO with telemetry or silently drifts into "the bot still cites the old policy." Once you state the SLO, every reindex strategy maps to "how do I keep this lag under X."

Freshness and reindex diagram — left zone shows a source CDC stream feeding an embed worker, middle zone shows an upsert into a 'current collection' card with a tombstone tag flowing through, right zone shows a blue-green collection swap card for embedding-model upgrades; a P95 freshness SLO ribbon spans the top of the diagram, on a light PipeCode card.

The freshness SLO.

  • Definition. P95 source-to-retrieval lag — the time between a source-of-truth update (Confluence save, Postgres commit, S3 put) and the moment the new chunk is retrievable in the vector store.
  • Typical thresholds. Knowledge base reference docs: 30-60 minutes. Live policy / pricing lookup: 1-5 minutes. Real-time support context: <30 seconds.
  • Why SLO not feature. A feature is "we reindex nightly." An SLO is "we promise P95 < 5 minutes and we will page if it breaks." Only the SLO survives contact with stakeholders.
  • Telemetry. now - max(last_modified) per source for the embed worker; query-side now - retrieved_chunk.last_modified. Graph both; alert on either.

Incremental reindex via CDC.

  • Source CDC. Postgres logical replication (Debezium), Confluence webhooks, Notion webhooks, S3 event notifications. The source emits "what changed" events into a topic (Kafka, Kinesis).
  • Embed worker. Consumes the topic, fetches the changed document, re-chunks, re-embeds only changed chunks (skip-unchanged via content_hash), upserts vectors and metadata.
  • Upsert semantics. Deterministic chunk_ids mean upsert is idempotent — the same change replayed produces the same final state.
  • Backfill mode. A new source connector starts in "full backfill" (read every doc once) and graduates to "CDC tail" (read changes only).

Tombstoning deletes.

  • The problem. A document deleted in the source must vanish from retrieval — but most teams only mark the source as deleted and forget the vector store.
  • The fix. When CDC emits a "delete" event, the embed worker either hard-deletes the chunk_ids for that document, or soft-deletes them by setting is_active=false in the metadata sidecar (and filters is_active=true on every retrieve).
  • Why soft-delete + nightly purge. Soft-delete is reversible (whoops, didn't mean to nuke that doc); nightly purge sweeps hard-deletes after a grace window.
  • Test. Every nightly eval golden set includes a "deleted doc must not appear" canary. If it appears, page.

Embedding model upgrade = full re-embed.

  • The rule. The vectors in the index were produced by model M. Switching to model M' invalidates every vector; queries embedded by M' do not match vectors from M.
  • The pattern. Blue/green collections. Create collection_v2 with the new model, re-embed everything, dual-write during transition, cut over reads atomically, then drop collection_v1.
  • Cost. A full re-embed of 100M chunks at $0.02/M tokens × 500 tokens/chunk ≈ $1000 in API spend, plus the worker compute. Budget for it before promising the upgrade.
  • Validation. Run the golden-set eval against v2 before cutover. If recall@5 is not meaningfully better, the upgrade is not worth it.

Per-tenant ACL pushdown.

  • The shape. Each chunk carries acl_ids: ["public", "support", "engineering"]. Each user has user_acl: ["public", "support"]. The retrieval filter is acl_ids OVERLAP user_acl pushed down into the vector store.
  • Why pushdown not post-filter. Same reason as tenant_id — post-filter forces huge over-fetch; pushdown is one extra predicate in the ANN traversal.
  • Sourcing the ACL list. Materialised at ingest time from the source's ACL system (Confluence space permissions, Notion page permissions, S3 bucket policy tags). Refreshed via CDC when permissions change.
  • Permission change is itself a CDC event. A user joins a new team → their user_acl updates → next query sees the wider chunk set. No reindex needed.

Quality regression detection.

  • Golden set drift. Recall@5 / MRR@10 trended over time. A sudden drop after an ingest job means a recent change broke retrieval.
  • Top-cause attribution. Diff the recent ingest config (chunker change, embedder change, schema change) against the last "green" deploy.
  • Auto-rollback. Some teams configure auto-rollback on a sustained ≥5pt recall drop — return the index to its previous state until a human investigates.

Common interview probes on freshness and reindex.

  • "What is your freshness SLO and how do you measure it?" — name a P95 number, name a telemetry source, name the alert.
  • "How do you handle deletes?" — soft-delete in metadata + filter on retrieve + nightly hard-purge sweep.
  • "How do you upgrade an embedding model in production?" — blue/green collections, dual-write, golden-set validation, atomic cutover.
  • "How do you enforce per-user permissions?" — acl_ids array column on every chunk; overlap filter pushed down into the vector store; permission changes flow through the same CDC stream.

Worked example — CDC-driven incremental reindex worker

Detailed explanation. A streaming embed worker consumes a CDC topic, fetches the changed document, re-chunks, re-embeds only changed chunks (skip-unchanged via content_hash), and upserts. Sub-minute lag at steady state.

Question. Sketch the embed worker that consumes a source.changes Kafka topic and upserts vectors with sub-5-minute P95 lag.

Input. A stream of CDC events.

event doc_id op last_modified
1 PAGE-42 upsert 2026-06-15T10:00:00Z
2 PAGE-09 delete 2026-06-15T10:00:05Z
3 PAGE-42 upsert 2026-06-15T10:00:30Z

Code.

def embed_worker_loop():
    consumer = kafka.Consumer("source.changes", group_id="rag-embed-worker")
    for event in consumer:
        try:
            if event["op"] == "delete":
                handle_delete(event["doc_id"])
            else:
                handle_upsert(event["doc_id"])
            consumer.commit(event)
            metric("rag.source_lag_seconds", now() - event["source_ts"])
        except Exception as e:
            log.error("embed_worker_error", error=str(e), event=event)
            consumer.send_to_dlq(event)


def handle_upsert(doc_id: str) -> None:
    doc = source.fetch(doc_id)                  # idempotent fetch by id
    new_chunks = chunk_dispatch(doc)            # per-content-type chunker
    existing = vector_store.scan(filter={"document_id": doc_id})
    existing_hashes = {c.metadata["content_hash"] for c in existing}

    to_upsert = []
    for new in new_chunks:
        h = hashlib.sha256(new["text"].encode()).hexdigest()
        if h in existing_hashes:
            continue                            # skip unchanged
        new["metadata"]["content_hash"] = h
        to_upsert.append(new)

    if to_upsert:
        vectors = embed_model.encode([c["text"] for c in to_upsert])
        for c, v in zip(to_upsert, vectors):
            c["vector"] = v
        vector_store.upsert(to_upsert)


def handle_delete(doc_id: str) -> None:
    # Soft-delete — mark as inactive; nightly purge sweeps hard deletes
    vector_store.update_metadata(
        filter={"document_id": doc_id},
        patch={"is_active": False, "deleted_at": now()},
    )
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The worker consumes one event at a time, processes it, and commits the offset only after the upsert succeeds. At-least-once delivery semantics are fine because chunk_ids are deterministic and upsert is idempotent.
  2. On an upsert event, the worker fetches the full document from the source (CDC events typically carry only IDs, not bodies), re-chunks, and computes the content hash of each chunk.
  3. The skip-unchanged optimisation: chunks whose hash matches a previously stored one are not re-embedded. For a single-paragraph edit on a 50-paragraph doc, only 1-2 chunks change — the savings compound.
  4. On a delete event, the worker soft-deletes by setting is_active=false on every chunk for that doc. The retrieval path filters is_active=true, so deleted chunks are invisible immediately.
  5. A nightly purge job hard-deletes rows where deleted_at < now() - 7 days, freeing index space.
  6. metric("rag.source_lag_seconds", now() - event.source_ts) is the per-event freshness measurement. Aggregated as P95 across an hour, this is the freshness SLO.
  7. Errors go to a DLQ (dead-letter queue) so a single bad doc cannot block the whole stream. The DLQ has its own alert.

Output.

event action chunks re-embedded lag
1 (PAGE-42 upsert) fetch + chunk + embed 3 of 12 changed ~45 s
2 (PAGE-09 delete) soft-delete 0 (metadata only) ~5 s
3 (PAGE-42 upsert) fetch + chunk + embed 1 of 12 changed ~30 s

Rule of thumb. The skip-unchanged + soft-delete + DLQ trio is the production shape of a CDC embed worker. Without them you re-embed everything on every change (cost explodes), you hard-delete on every event (no recovery), or you crash on the first malformed doc (whole stream stalls).

Worked example — blue/green embedding model upgrade

Detailed explanation. A model upgrade invalidates every vector in the index. The safe pattern is blue/green: build a parallel v2 collection with the new model, dual-write incoming CDC events into both, validate against the golden set, cut over reads atomically, drop the old collection.

Question. Sketch the blue/green upgrade flow from embedding model e5-large-v1 to e5-large-v2 for a 50M-chunk index.

Input. Existing collection_v1 with all chunks embedded by e5-large-v1.

Code.

# Phase 1 — Create v2 collection and full re-embed (background)
def backfill_v2():
    create_collection("collection_v2", dims=v2_model.dims)
    for batch in scan_chunks("collection_v1", batch_size=1000):
        texts = [text_store.get(c.id) for c in batch]
        vectors = v2_model.encode(texts)
        rows = []
        for c, v in zip(batch, vectors):
            rows.append({
                **c.as_dict(),
                "vector": v,
                "metadata": {**c.metadata, "embed_model_id": "e5-large-v2"},
            })
        upsert("collection_v2", rows)


# Phase 2 — Dual-write incoming CDC into both collections
def handle_upsert_dualwrite(doc_id: str):
    doc = source.fetch(doc_id)
    new_chunks = chunk_dispatch(doc)
    v1_vecs = v1_model.encode([c["text"] for c in new_chunks])
    v2_vecs = v2_model.encode([c["text"] for c in new_chunks])
    for c, v1, v2 in zip(new_chunks, v1_vecs, v2_vecs):
        upsert("collection_v1", [{**c, "vector": v1}])
        upsert("collection_v2", [{**c, "vector": v2}])


# Phase 3 — Validate v2 against the golden set
def validate_v2() -> bool:
    v1_metrics = score_pipeline(golden, collection="collection_v1")
    v2_metrics = score_pipeline(golden, collection="collection_v2")
    return v2_metrics["recall_at_k"] >= v1_metrics["recall_at_k"] - 0.005


# Phase 4 — Atomic read cutover via feature flag
def retrieve(query, tenant_id):
    collection = "collection_v2" if FLAGS["read_v2"] else "collection_v1"
    model = v2_model if FLAGS["read_v2"] else v1_model
    qvec = model.encode([query])[0]
    return search(collection, qvec, filter={"tenant_id": tenant_id})


# Phase 5 — Drop v1 after a 7-day soak
def cleanup_v1():
    if FLAGS["read_v2"] and days_since_cutover() >= 7:
        drop_collection("collection_v1")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Phase 1 (backfill): a background job iterates every chunk in v1, re-embeds with the new model, writes into v2. Heavy compute and API spend; runs over hours or days for a large index.
  2. Phase 2 (dual-write): from the moment the backfill starts, every CDC event is written into both collections. This keeps v2 in sync with new edits while the backfill is still running.
  3. Phase 3 (validate): run the golden-set eval against both collections. v2 must match or beat v1 on recall@5 / MRR@10. A regression is a stop sign.
  4. Phase 4 (cutover): flip a feature flag. The read path switches from v1 to v2 and from v1_model to v2_model. The flip is instant and reversible.
  5. Phase 5 (cleanup): after a 7-day soak (during which you can flip back if something breaks), drop v1 and reclaim storage.
  6. Throughout: dual-write costs 2x the embed compute. Plan the rollout so the cost window is bounded — typically a few days, not weeks.

Output.

phase duration risk
backfill hours-days high cost, no user impact
dual-write duration of rollout 2x embed cost
validate 1-3 days catches regressions
cutover instant reversible via flag
cleanup after 7d soak irreversible

Rule of thumb. Never hot-swap an embedding model. Always go blue/green: backfill → dual-write → validate → atomic cutover → soak → drop old. The 2x embed cost during rollout is the price you pay for zero downtime and instant rollback.

Worked example — per-tenant ACL pushdown end-to-end

Detailed explanation. Multi-tenant RAG with per-document permissions means every chunk carries an acl_ids array and every query carries the user's user_acl. The retrieval predicate is acl_ids OVERLAP user_acl, pushed down into the vector store.

Question. Show the ingest-side ACL materialisation and the retrieval-side overlap filter for a Confluence-backed RAG with space-level permissions.

Input. A Confluence page with space ACLs.

field value
page_id "PAGE-42"
space_id "ENG"
space_acl ["eng-team", "leadership"]

Code.

# Ingest — materialise ACL on every chunk
def ingest_with_acl(page):
    acl_ids = space_acl_resolver.resolve(page["space_id"])  # ["eng-team", "leadership"]
    chunks = chunk_dispatch(page)
    for c in chunks:
        c["metadata"]["acl_ids"] = acl_ids
        c["metadata"]["space_id"] = page["space_id"]
    vectors = embed_model.encode([c["text"] for c in chunks])
    for c, v in zip(chunks, vectors):
        c["vector"] = v
    vector_store.upsert(chunks)


# Retrieve — overlap filter pushed down
def retrieve_for_user(query, user_id, tenant_id):
    user_acl = user_directory.get_acl(user_id)  # e.g. ["eng-team", "all-hands"]
    qvec = embed_model.encode([query])[0]
    return vector_store.search(
        vector=qvec,
        top_k=50,
        filter={
            "tenant_id": tenant_id,
            "acl_ids": {"$overlap": user_acl},   # array overlap
            "is_active": True,
        },
    )


# Permission change — same CDC stream
def on_user_acl_change(user_id):
    # No reindex needed — next query picks up the new ACL
    user_directory.invalidate(user_id)

def on_space_acl_change(space_id):
    # Re-materialise ACL on every chunk in this space
    new_acl = space_acl_resolver.resolve(space_id)
    vector_store.update_metadata(
        filter={"space_id": space_id},
        patch={"acl_ids": new_acl},
    )
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. At ingest, the chunker gets the space's ACL list and writes it into every chunk's metadata. The space ID is also stored so the chunks can be updated atomically if the space ACL changes.
  2. The retrieval query pushes acl_ids OVERLAP user_acl down into the vector store. The user sees only chunks whose ACL intersects their permission set.
  3. User permission change (joins a new team) does not require reindex — the user's user_acl updates in the directory, next query reflects it.
  4. Space permission change (a previously-private space becomes shared with another team) does require updating every chunk in that space — but the update is a metadata patch, not a re-embed. Sub-second on most vector stores.
  5. The overlap filter is one extra predicate in the ANN search, so the perf cost is negligible compared to the tenant filter.
  6. Combined with tenant_id and is_active, the retrieval contract is "tenant-isolated, ACL-filtered, soft-delete-aware." Three predicates, all pushed down, all enforced in the vector store itself.

Output.

user user_acl accessible chunks
alice (eng) ["eng-team", "all-hands"] PAGE-42 (eng-team) + public
bob (sales) ["sales", "all-hands"] not PAGE-42
ceo ["leadership", "all-hands"] PAGE-42 (leadership) + public

Rule of thumb. ACL pushdown belongs in the same query as the tenant filter — both filters live or die together in the vector store. Application-side post-filter for permissions is an authz time bomb; never ship it as the primary boundary.

Data engineering interview question on freshness and tombstoning

A senior interviewer might frame this as: "Stakeholder reports the bot is still citing a deleted policy doc. How do you design the pipeline so this cannot happen, and what is the SLO you commit to?"

Solution Using soft-delete + freshness SLO + golden-set canary

# 1) Soft-delete on CDC delete events
def handle_delete(doc_id):
    vector_store.update_metadata(
        filter={"document_id": doc_id},
        patch={"is_active": False, "deleted_at": now()},
    )

# 2) Retrieve filters out inactive chunks
def retrieve(query, tenant_id):
    return vector_store.search(
        vector=embed_model.encode([query])[0],
        top_k=50,
        filter={"tenant_id": tenant_id, "is_active": True},
    )

# 3) Nightly purge sweeps hard deletes
def nightly_purge():
    vector_store.delete(filter={
        "is_active": False,
        "deleted_at": {"$lt": days_ago(7)},
    })

# 4) Golden-set canary catches regressions
DELETED_DOC_CANARY = {
    "question": "What was the old refund window?",
    "must_not_match_source": "confluence/PAGE-deleted-99",
}

def canary_check():
    hits = retrieve(DELETED_DOC_CANARY["question"], tenant_id="default")
    leaked = any(h.metadata["source"] == DELETED_DOC_CANARY["must_not_match_source"]
                 for h in hits)
    if leaked:
        page_oncall("deleted_doc_leak", source=DELETED_DOC_CANARY)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step What it does SLO
1 soft-delete metadata is_active=false on CDC delete P95 < 30 s
2 retrieve filter only is_active=true chunks returned always
3 nightly purge hard-delete after 7-day soak nightly
4 canary golden-set probe for known-deleted docs every nightly eval

The trace highlights that deletion safety is the conjunction of four steps, not any one of them — the soft-delete handles the instant invisibility, the retrieve filter enforces it, the nightly purge reclaims storage, and the canary catches the day someone breaks step 2.

Output:

Failure mode Catch step
CDC delete event missed step 4 (canary in next eval)
Soft-delete not yet propagated step 1 (30s SLO)
Retrieve filter dropped step 4 (canary fires)
Hard purge premature 7-day soak window

Why this works — concept by concept:

  • Soft-delete first, hard-delete later — gives a reversible window. Whoops-deletes recover in seconds; intentional deletes flush after 7 days.
  • Retrieve-side filter on is_active — the safety boundary lives in the query, not in the deletion handler. Even if the delete event is replayed or lost, the next query still respects is_active.
  • Golden-set canary — a known-deleted doc in the golden set is a continuous integration test for deletion. If the bot ever cites it, the eval fails before the user sees it.
  • Freshness SLO ties the loop — the P95 source-to-retrieval lag SLO covers both adds and deletes; one number, alerted on, owned by DE.
  • Cost — soft-delete is one metadata write per deleted chunk; canary is one extra row in the golden set; purge is a nightly bulk delete. All bounded; none cost more than a few minutes a day.

DE
Topic — data validation
Data validation problems (DE)

Practice →


Cheat sheet — RAG pipeline recipes

  • Chunk size starter. 500 tokens with 75-token overlap (15%). Tune against the golden set before changing.
  • Per-content-type strategy. Prose → recursive; code → AST; tables → atomic; transcripts → semantic; long-form policy → hierarchical (sentence child, paragraph parent).
  • Hybrid score fusion. Reciprocal Rank Fusion (RRF) with k=60 — robust, no tuning. Weighted 0.6 * dense + 0.4 * BM25 only if you have a held-out tuning set.
  • Metadata filter on retrieve. WHERE tenant_id = $1 AND acl_ids OVERLAP $2 AND is_active = true — pushed down into the vector store, never post-filter.
  • Reranker shape. Top-50 ANN candidates → cross-encoder batch → top-5 prompt context. Wrap with 120ms timeout and graceful degradation.
  • Embed model rule. Store embed_model_id in every chunk's metadata; assert it matches the query embedder at retrieval time. Catches the silent model-mismatch bug.
  • Freshness pipeline. Source CDC → Kafka → embed worker → vector upsert. P95 source-to-retrieval lag SLO < 5 minutes (tune to use case).
  • Skip-unchanged. Hash chunk text with SHA-256; store in metadata; re-embed only when hash changes. Turns nightly reindex into sub-minute incremental.
  • Tombstone delete. Soft-delete via is_active=false in metadata; filter on retrieve; nightly purge sweep hard-deletes after 7 days.
  • Embedding model upgrade. Blue/green collections — backfill → dual-write → golden-set validate → atomic cutover → 7-day soak → drop old. Never hot-swap.
  • ACL pushdown. acl_ids array on every chunk; acl_ids OVERLAP user_acl filter pushed down. Permission changes update the user directory, not the chunks (except space-level changes).
  • Golden set. 200-2000 (question, expected_chunk_id) triples maintained by SMEs. Nightly recall@5 and MRR@10 with alerts on a sustained 5pt drop.
  • Deletion canary. A known-deleted doc in the golden set whose source must not appear in retrieval. Continuous integration for deletion.
  • DLQ on the embed worker. Errors do not block the stream; a malformed doc goes to a DLQ with its own alert.
  • Telemetry. Source lag, embed queue depth, ANN P95, reranker P95, fallback ratio, recall@5. Six numbers, six graphs.

Frequently asked questions

What chunk size should I start with for a new RAG pipeline?

Start at 500 tokens with 75-token overlap (15%). It is the empirically calibrated default that lands inside most embedding model context windows (most are 512 tokens) and leaves enough room for the LLM to receive 5-10 chunks per prompt. Tune from there against your golden set — long-form policy docs often benefit from hierarchical chunking (120-token children, 600-token parents); transcripts benefit from semantic chunking with similarity threshold 0.55; code should be split by AST function/class boundaries with no overlap. The cheat-sheet recipe to memorise: "500/75 default, per-content-type adapters, golden-set tuning."

Do I need hybrid search or is dense retrieval enough?

You almost always need hybrid search in production. Pure dense retrieval misses on rare keywords (error codes, product SKUs, proper nouns the embedder has never seen), which is where 10-15 points of recall@5 hide. Pure BM25 misses on semantic synonyms ("revenue" vs "income"). Fusing them with Reciprocal Rank Fusion (RRF, k=60) recovers both regimes without per-system weight tuning. Skip hybrid only if (a) your content is all natural-language prose with no rare-vocab terms and (b) your golden-set recall is already at the threshold you need. In every other case, hybrid is the 2026 default for hybrid search.

How do I keep my RAG index fresh?

State a freshness SLO (e.g. P95 source-to-retrieval lag < 5 minutes), implement a CDC pipeline (Postgres logical replication, Confluence webhooks, Notion webhooks, S3 events) that emits change events into Kafka, and run an embed worker that consumes the stream, re-chunks the changed document, and upserts the new vectors with deterministic chunk_ids. Use content_hash to skip-unchanged chunks so a one-paragraph edit only re-embeds one chunk. For deletes, soft-delete via is_active=false in metadata and filter on retrieve; nightly purge sweeps hard-deletes after a 7-day soak. Graph now - max(last_modified) per source as the SLO telemetry; alert on breaches.

When do I need a reranker?

You need a reranker as soon as top-5 precision matters — which in practice is "as soon as your bot ships to real users." The standard shape is hybrid retrieves 50 candidates, the cross-encoder reranks all 50 in one batch, top-5 go into the prompt. Reranker adds 50-100ms; lifts recall@1 and MRR significantly because cross-encoders apply full attention across the query and the chunk text, not just a dot product. Cohere rerank-v3 and BGE-reranker are the common hosted/OSS picks. Always wrap the call with a timeout and a graceful-degradation fallback to "hybrid top-k without rerank" — never fail the user-facing query because the reranker hiccupped.

How do I enforce per-user permissions in RAG?

Store an acl_ids array on every chunk at ingest time (materialised from the source's permission system — Confluence space ACL, Notion page permissions, S3 bucket tags), and store each user's user_acl in a directory service. At retrieval time, push a acl_ids OVERLAP user_acl predicate down into the vector store alongside the tenant_id filter. Never apply ACL as an application-side post-filter — the perf cost is huge (massive over-fetch needed) and the safety story is weaker because the boundary lives in two places. User permission changes update the directory only (no reindex needed); space-level permission changes patch the acl_ids metadata on every chunk in that space (sub-second on most vector stores).

How do I evaluate RAG quality before shipping?

Build a golden set of 200-2000 (question, expected_chunk_id, expected_answer) triples with subject-matter experts. Run a nightly eval that fires every question through the live retrieval pipeline (same embed model, same filter, same rerank) and scores recall@5 (was the gold chunk in the top-5?) and MRR@10 (1/rank, averaged). Threshold recall@5 ≥ 0.85 as a typical production gate; alert on a sustained 5-point drop. Add a "deleted doc canary" to the golden set so deletion regressions surface. For online quality, sample live queries and score them with an LLM-as-judge against a rubric — useful for drift detection but not as ground truth. Without a golden set, every RAG change is a vibe-based decision.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every RAG pipeline recipe above ships with hands-on practice rooms where you write the CDC consumer, the chunker, the hybrid retrieval query, and the golden-set eval harness against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your `rag data pipeline` will behave the same in production as it did on the whiteboard.

Practice ETL pipelines now →
Streaming drills →

Top comments (0)