Vijaya Rajeev Bollu

Posted on May 14

# I Built a RAG System That Enforces Its Own Citations — And Blocks Its Own Merges

#ai #machinelearning #python #devops

The Problem With Most RAG Tutorials

Every RAG tutorial ends the same way.

You send a question, the LLM returns an answer, and you ship it. What the tutorial doesn't show you: that answer might be confidently fabricated. The LLM might be citing a source it invented. There's no way to know.

I spent three days debugging exactly this in a prototype. The system sounded authoritative. It was hallucinating chunk references that didn't exist in the retrieved context.

The fix wasn't a better prompt. The fix was building actual enforcement into the pipeline — and then automating quality measurement so metric regressions literally cannot ship.

This is a production-grade RAG system that answers questions from documents with verifiable citations, hybrid search, cross-encoder re-ranking, and CI/CD quality gates.

Architecture

The request flow looks like this:

POST /query
  → HybridRetriever.search()   # BM25 + vector via RRF
  → rerank()                   # Cohere cross-encoder, top 5 from 20
  → generate_answer()          # gpt-4o-mini + citation validation
  → QueryResponse              # cited answer or refusal

Three layers, each with a measurable reason to exist.

Layer 1: Hybrid Retrieval (BM25 + Vector)

Pure vector search fails on exact terminology. If a document says "CO2 emissions" and the user asks "carbon dioxide output," a cosine similarity search might miss it. BM25 catches it because it matches exact tokens.

Reciprocal Rank Fusion (RRF) merges the two ranked lists:

score(doc) = 1/(k + rank_vector) + 1/(k + rank_bm25)

With k=60, this gives stable fusion without needing to weight or normalize the scores. BM25 index rebuilds in-memory on each query from ChromaDB — always reflects current state, no sync required.

The tradeoff: rebuild latency. For large corpora this matters. For most document Q&A workloads, it's negligible.

Layer 2: Cross-Encoder Re-Ranking

The retriever returns 20 candidates. The re-ranker returns the top 5.

Bi-encoders (used in vector search) embed query and document independently — fast but imprecise. Cross-encoders (Cohere rerank-english-v3.0) see the query and document together — slower, but significantly more accurate on relevance.

The pattern: retrieve broadly (20), re-rank precisely (top 5). You get the recall of broad retrieval with the precision of cross-encoding.

Layer 3: Citation Enforcement

Every chunk stored in ChromaDB gets a unique ID: chunk- followed by 8 hex characters. The LLM is instructed to cite these IDs inline:

Global warming is primarily driven by greenhouse gas emissions [chunk-1a2b3c4d].

After generation, a regex extracts every cited ID from the answer. Each one is checked against the set of IDs that were actually passed into the prompt. If any cited ID doesn't exist in that set — hallucinated reference — the answer is replaced with a refusal.

cited_ids = set(re.findall(r'\[chunk-([0-9a-f]{8})\]', answer))
hallucinated = cited_ids - valid_chunk_ids
if hallucinated:
    return REFUSAL_RESPONSE

This doesn't prevent the LLM from being wrong about the content of real chunks. But it prevents citation fabrication — a distinct and common failure mode.

The CI/CD Quality Gate

This is the part most RAG tutorials skip entirely.

Ragas measures two metrics:

Faithfulness: Is the answer supported by the retrieved context?
Context precision@5: Are the retrieved chunks actually relevant to the question?

I maintain a golden dataset of 20 hand-verified Q&A pairs. On every PR to main, GitHub Actions:

Starts ChromaDB
Ingests demo documents
Runs all 20 questions through the full pipeline
Scores faithfulness and context precision@5 via Ragas
Fails the workflow if faithfulness < 0.85 or context_precision@5 < 0.70

Metric regressions cannot merge. You find out in CI before any code ships, not after a deploy.

The check script is a standalone Python file (scripts/check_quality_gate.py) that exits 1 if thresholds aren't met — easy to wire into any CI system.

What I Learned

1. Chunk size is not arbitrary.
I tested 500, 700, and 1000 characters. At 500, long paragraphs split mid-sentence and the re-ranker couldn't reconstruct context. At 1000, chunks were too long for the LLM to synthesize cleanly. 700 with 100 overlap hit the right balance for the climate domain documents I was using. This number will be different for your corpus — test it.

2. The golden dataset quality matters more than its size.
My first golden dataset had subjective Q&A pairs — "What is an important source of emissions?" Any answer could be justified. I rebuilt it with binary-verifiable claims: exact figures, named entities, specific relationships. Ragas faithfulness scoring only means something if the ground truth is unambiguous.

3. Citation format is load-bearing.
The LLM initially produced (chunk-042) and chunk_1a2b3c4d — close but not matching the regex. The fix was putting the exact format string in the system prompt with an explicit example, not just a description. Format specification in prompts must be concrete.

4. BM25 re-index latency is real.
Rebuilding the BM25 index on every query adds latency proportional to corpus size. For 500 chunks it's ~5ms. For 50,000 chunks it becomes a problem. The current design is correct for a portfolio-scale corpus; at production scale you'd maintain a persistent BM25 index and update it incrementally on ingest.

5. Prompt versioning changes how you iterate.
Moving prompts to prompts/rag_prompts.yaml with a version field meant I could iterate on prompt content without touching Python code and track what changed in git diffs. It also let me hot-reload prompts at startup without redeploying. Small architectural decision, large practical impact.

Limitations

Synchronous BM25 rebuild. Rebuilds from ChromaDB on every query. Fast for small corpora, problematic at scale. A persistent index with delta updates would fix this.

Single collection. All documents share one ChromaDB collection (rag_documents). There's no namespace isolation between document sets. If you ingest documents for two different topics, retrieval can bleed across domains.

Golden dataset is climate-domain only. The evaluation is tuned for the demo documents. Ragas metrics are meaningful only when the golden dataset matches your actual document domain.

No streaming. POST /query waits for the full answer before returning. For long answers, this adds perceived latency. FastAPI supports streaming responses via StreamingResponse — not implemented here.

Citation enforcement catches fabricated IDs, not wrong facts. If the LLM correctly cites a real chunk but misrepresents what it says, that passes citation enforcement. Ragas faithfulness catches this, but at evaluation time, not at runtime.

Try It

GitHub: [https://github.com/ThinkWithOps/01-rag-from-scratch]
Demo: [https://youtu.be/wRZpmzIexnQ]

git clone https://github.com/ThinkWithOps/01-rag-from-scratch.git
cd 01-rag-from-scratch
cp .env.example .env
# Add OPENAI_API_KEY and COHERE_API_KEY

docker compose up -d
bash scripts/ingest_demo_docs.sh

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main cause of climate change?", "top_k": 5}'

Run the full evaluation:

bash scripts/run_evaluation.sh

What's your quality bar for RAG before you'd ship it to users? Drop it in the comments.

DEV Community