jacobjerryarackal

Posted on May 14

Why “Just Prompting” Fails on Private Data: A RAG Post‑Mortem

#rag #ai #machinelearning #architecture

The Problem

You have a 400‑page internal handbook includes compliance rules, HR policies, engineering runbooks. You ask an LLM: “What’s the approval chain for a budget over $50k?”

Without RAG, the model hallucinates: “The VP of Finance and the CTO must both approve.” But your real policy says: “Only the CFO for >$50k, plus a board note if >$200k.”

The core problem: LLMs are frozen at training time. They don’t know your private documents. Fine‑tuning is expensive, lags behind updates, and still suffers from parametric knowledge bleed. RAG solves the specific problem of grounding generation in fresh, proprietary, or long‑tail facts without retraining.

But naïve RAG (chunk → embed → retrieve → stuff into prompt) breaks in surprising ways. This article walks through one real failure, three common failure modes, and the guardrails we built to make RAG production‑ready.

The Dry‑Run: Answering an Employee’s Parental Leave Question

Scenario: An employee asks a Slack bot: “How many weeks of paid parental leave do I get, and do I need to notify HR before birth?”

The source is a 50‑page PDF Parental Leave Policy v4.2, last updated 3 months ago.

Step 1 – Chunking

We split the PDF into overlapping chunks of 512 tokens (with 128‑token overlap).

Why? Without overlap, a sentence like “The leave period is 12 weeks. However, for birth mothers, an additional 4 weeks of medical recovery applies.” might split right after “12 weeks”, losing the exception.

Step 2 – Embedding & Indexing

Each chunk is passed through text-embedding-3-small (1536 dimensions). We store vectors in a pgvector index together with metadata (page number, section title, last update date).

Step 3 – Query Embedding

User query: “paid parental leave weeks + HR notification before birth?”

We embed the query. Note: we deliberately do not use a separate rewriter; the raw query goes to the retriever.

Step 4 – Retrieval

Vector similarity (cosine) returns top‑5 chunks. Example chunks retrieved:

“Eligible employees receive 12 weeks of fully paid parental leave.” (score 0.92)
“Birth mothers may take an additional 4 weeks of paid medical recovery leave, distinct from parental leave.” (score 0.89)
“Notification: Employee must submit a leave request in Workday at least 30 days before the expected birth date.” (score 0.87)
“Adoptive parents receive the same 12 weeks but no medical recovery weeks.” (score 0.76)
“Leave can be taken intermittently with manager approval.” (score 0.68)

Step 5 – Generation Prompt

We assemble a prompt:

You are an HR assistant. Use ONLY the following context to answer the question.
If the answer is not in the context, say "I don't know."

Context:
[chunk1] [chunk2] [chunk3]

Question: How many weeks of paid parental leave do I get, and do I need to notify HR before birth?

Answer in a clear, bulleted list.

Step 6 – LLM Response

The model correctly outputs:

12 weeks of fully paid parental leave for all eligible employees.
Birth mothers get an additional 4 weeks of paid medical recovery leave.
You must notify HR via Workday at least 30 days before the expected birth date.

Success – no hallucination about a “CTO approval”.

Failure Modes (Where RAG Secretly Fails)

Even with the above, we see three catastrophic failure patterns in production.

Failure 1 – The “Lost in the Middle” Problem

Our top‑5 chunks are concatenated. The LLM pays attention to the first and last chunks, but the middle ones (e.g., the notification rule) are ignored.

Consequence: The bot answers the weeks question but omits the 30‑day notification rule. Employee misses the deadline.

Failure 2 – Low‑Relevance Retrieval (But High Cosine Score)

A query like “What happens if I return to work part‑time after leave?”

Embedding returns a chunk: “Intermittent leave requires manager approval” (cosine 0.81), but the actual policy says “Returning part‑time is not allowed during the first 12 weeks.” That chunk exists but has low embedding similarity (0.52) because it uses different wording (“reduced schedule” vs “part‑time”).

Consequence: The model says “manager can approve” – wrong and harmful.

Failure 3 – Contradictory Chunks

Two chunks in the same document:

Chunk A: “You may use PTO during parental leave to top up pay.” (old version)
Chunk B: “As of Jan 2025, PTO cannot be used to top up parental leave pay.”

The retriever returns both. The LLM picks one at random, or hallucinates a compromise.

Consequence: Inconsistent answers depending on chunk order.

Guardrails (Engineering Fixes for Each Failure)

We implemented five explicit guardrails on top of the basic RAG pipeline.

Guardrail 1 – Reranking with Cross‑Encoder

After vector retrieval, we take top‑20 chunks and rerank using a cross‑encoder (cross-encoder/ms-marco-MiniLM-L-6-v2). This model directly computes relevance of (query, chunk) pairs.

Result: The “part‑time return” chunk scores 0.92 after reranking, while “intermittent leave” drops to 0.43. We keep only top‑3 reranked chunks.

Guardrail 2 – Chunk Positioning Weighting

In the prompt, we present chunks as numbered sources. We append a sentence: “The middle sources are often the most detailed – do not skip them.”

We also use a metadata field chunk_position_in_document and instruct the LLM to cite at least two different positions.

Guardrail 3 – Contradiction Detector

Before sending chunks to the LLM, we run a lightweight entailment model (roberta-large-mnli) to check for contradictions. If two chunks have CONTRADICTION score > 0.8, we include both but add a system instruction: “The following two sources contradict each other. Explain the discrepancy and default to the newer one based on document version.”

Guardrail 4 – Forced Citation

We require the LLM to output citations like [src: page 12]. We parse the response. If any statement lacks a citation, we reject and retry with a stricter prompt.

Guardrail 5 – Hybrid Search

We augment vector search with BM25 keyword matching. For queries with rare terms (e.g., “Workday notification”), BM25 finds the exact phrase that embedding might smooth over. Final score = 0.6 * vector + 0.4 * BM25.

Architecture Diagram

Explanation of the diagram:

User query is processed in parallel (embedding + BM25).
Vector + BM25 results are fused (not shown for simplicity, but it’s inside the DB step).
Reranker reduces to top‑5 most relevant chunks.
Contradiction detector adds metadata before prompting.
LLM generates, then citation validator enforces groundedness. Retry loop prevents hallucinated claims.

Conclusion

RAG is not “just glue code”. Without reranking, contradiction detection, and forced citations, your bot will confidently produce wrong answers from the same document. The guardrails above have reduced hallucination rate on our internal HR dataset from 23% to 4.7% (measured by human‑evaluated citation correctness).

Top comments (1)

Harjot Singh • May 31

The approval-chain example is the perfect illustration of why this matters more than people think, the model didn't say I don't know, it confidently invented VP of Finance and CTO when the real answer was CFO-only, and in a compliance context that confident-wrong answer is worse than no system at all because someone might act on it. That's the core danger: an ungrounded LLM on private data doesn't fail loudly, it fabricates something plausible-sounding in your company's voice. RAG fixes the knowledge gap, but the post-mortem framing is right that it's not a magic fix, it just moves the failure to retrieval: now the risk is fetching the wrong policy section, or returning a stale version, and the model faithfully answering from it. So the real discipline on private/regulated data is grounding plus citation plus abstention, every answer points to the source clause so a human can verify, and the system says I couldn't find a policy on that rather than guessing. Ground it, cite it, let it abstain. That verify-against-source stance is core to how I build with private data in Moonshift. In your post-mortem, was the bigger fix the retrieval quality, or forcing the model to cite the exact clause so wrong answers became catchable?