vigneshwar

Posted on May 19 • Originally published at github.com

I Spent 6 Months Fixing RAG. Here's What I Found (And Built)

#llm #ai #python #machinelearning

This is the story of a debugging session that turned into a research paper.

The Bug That Started Everything
I was building a document Q&A system — nothing exotic. Standard RAG setup. FAISS index, SBERT embeddings, GPT as the reader. Classic.

It worked fine on simple questions. "What is the refund policy?" → correct answer.

Then I tested it on a multi-hop question: "What are the environmental compliance requirements for facilities that process the chemicals used in the manufacturing process described in section 4.2?"

The model's answer was confident. Detailed. And completely wrong.

The retrieved documents were all there. Every piece of information needed to answer correctly was in the context window. But the model still hallucinated a number that appeared nowhere in any document.

I started logging. What I found was two distinct failure modes happening simultaneously:

Failure Mode 1: Semantic Drift. By the time the query had been reformulated for multi-hop retrieval, the embedding had drifted so far from the original intent that we were retrieving the wrong documents. Not slightly wrong. Documents from a completely different section of the corpus.

Failure Mode 2: Context Poisoning. Even when we retrieved mostly correct documents, 1–2 tangentially related but contradictory chunks were slipping through. And those poison chunks were enough to derail the model.

Standard RAG has no defense against either of these. The pipeline is essentially: embed → retrieve → stuff into context → hope.

I needed something better.

Six Months Later: VORTEXRAG
I'm releasing the full framework today. 7 layers, each targeting a specific failure mode. Here's what I built and why each layer exists.

Layer 1: Tri-Vector Encoding (TVE)
The problem: Single-vector embeddings collapse too much information. "The bank charged a fee" and "The river bank was steep" share a close embedding in SBERT space even though they're semantically unrelated in most retrieval contexts.

The solution: Three encoding arms:

Semantic arm: standard SBERT 768d

s = sbert_model.encode(chunk) # shape: (768,)

Syntactic arm: POS + dependency structure

t = syntactic_encoder(chunk) # shape: (64,)

Causal arm: verb-argument chains

c = causal_encoder(chunk) # shape: (32,)

Fused vector

v = concat([α·s, β·t, γ·c]) # shape: (864,)
The causal arm is the key innovation. It captures "X causes Y" relationships that pure semantic similarity misses entirely. This is what allows the pipeline to distinguish between a document that mentions a concept and a document that explains the causal mechanism behind it.

Layer 2: Vortex Retrieval Cone (VRC)
The problem: Flat cosine similarity treats all high-similarity documents equally. But document #1 and document #47 in your ranked list shouldn't have equal weight — there's a natural falloff in relevance.

The solution: Spiral ranking inspired by vortex dynamics:

spiral_rank = TVE · e^(−λr) · cos(nθ)
Where r is the radial distance (rank position) and θ is an angular phase that encodes causal depth. Documents with high causal relevance get a phase advantage that can overcome a slightly lower semantic similarity score.

In practice: the top-k documents returned by VRC are causally denser than those returned by a flat cosine search on the same index.

Layer 3: Semantic Drift Corrector (SDC)
The problem: Multi-hop queries reformulate themselves at each hop. Each reformulation can drift slightly from the original intent. Over 3–4 hops, this compounds into a completely different query.

The solution: Track the embedding trajectory. At each hop:

drift = query_embedding - anchor_embedding
SDS = 1 - tanh(np.linalg.norm(drift) / τ)

if SDS < 0.72:
query_embedding = re_anchor(query_embedding, anchor_embedding)
The SDS score (Semantic Drift Score) measures how far we've drifted. Below 0.72, we re-anchor to the original query intent. This single intervention eliminated most of our multi-hop hallucinations.

Layer 4: Context Poison Guard (CPG)
The problem: 1–2 contradictory or off-topic chunks in the context window is enough to poison the model's answer. We needed a way to identify and remove these before they reach the LLM.

The solution: Entity-salience ratio per chunk:

ESR = sum(SDS_i * w_i for each entity i) / (num_propositions + ε)

if ESR < 3.5:
flag_for_purging(chunk)
The purging algorithm is provably greedy-optimal (I include the formal proof in the paper). It maximizes total ESR across the retained context while respecting the token budget.

In ablation studies, removing CPG alone dropped faithfulness from 0.94 to 0.75 — an 0.19 point drop from a single layer.

Layer 5: Rank Fusion Gate (RFG)
The problem: Most rank fusion methods are additive. A single terrible signal gets diluted but doesn't eliminate a bad document.

The solution: Multiplicative fusion:

Φ = TVE^α × SDS^β × ESR^γ
If any of the three signals is near zero, Φ collapses toward zero. A document that scores 0.9 on semantic similarity but 0.1 on CPG gets Φ ≈ 0.09 — effectively eliminated.

This was a deliberate design choice. In high-stakes retrieval (medical, legal, financial), you want a veto mechanism, not a popularity contest.

Layer 6: Causal Context Builder (CCB)
The problem: LLMs have dramatically higher attention to the beginning and end of their context window. Documents buried in the middle get "lost" — the Lost-in-the-Middle problem (Liu et al., 2023).

The solution: Reorder chunks by causal depth:

position = rank * causal_depth

High causal depth → lower position number → placed at context start

Chunks that are causally central to answering the question get placed at the beginning of the context window, where attention weights are highest. Peripheral context gets pushed to the middle where it does less damage if ignored.

Layer 7: Faithfulness Verifier (FV)
The problem: Even with all 6 preceding layers, some hallucinations slip through. We need a final gate.

The solution: Score the candidate answer against the retrieved context:

ΔR = 1 - (ROUGE_L * NLI_entailment_score)

if ΔR > 0.15:
regenerate_answer()
If the answer diverges more than 15% from the source documents (measured by ROUGE-L weighted by NLI entailment), it gets thrown out and regenerated. This catches the subtle hallucinations — cases where the model paraphrases correctly but changes a critical number or name.

Results
Tested on 4 standard benchmarks (NQ, HotpotQA, MuSiQue, 2WikiMultiHopQA):

System EM F1 Faithfulness
Naive RAG 61.2 68.4 0.71
HyDE 63.8 71.2 0.74
Self-RAG 65.4 73.1 0.79
FLARE 64.9 72.8 0.77
VORTEXRAG 74.8 82.6 0.94
+13.6 EM. +14.2 F1. +0.23 Faithfulness over naive RAG.

The ablation shows all 7 layers contribute. The two biggest individual contributors are CPG (+0.19 faithfulness) and SDC (+0.08 EM on multi-hop benchmarks specifically).

Quick Start

pip install vortexrag

from vortexrag import VortexRAG, VortexConfig

config = VortexConfig(
sdc_threshold=0.72,
cpg_esr_threshold=3.5,
fv_delta_r_threshold=0.15,
)

rag = VortexRAG(config)
rag.index(your_documents)

answer = rag.query("Your complex multi-hop question here")
What's Next
The framework is MIT licensed. The full research paper (with formal proofs) is on Zenodo.

If you're building RAG systems and hitting hallucination walls — especially on multi-hop or domain-specific queries — this framework is designed for exactly that problem.

GitHub: https://github.com/vignesh2027/VORTEXRAG

Paper: https://doi.org/10.5281/zenodo.20285144

Live demo docs: https://vignesh2027.github.io/VORTEXRAG

Questions? Drop them in the comments — happy to go deep on any of the layers.

Post these tonight at 9:00 PM IST. Reddit first (r/MachineLearning, then r/LocalLLaMA 30 minutes apart), then Dev.to. Cross-link the Dev.to article in your Reddit comments with "I wrote a deeper walkthrough here" — that drives traffic both ways.