Why Traditional RAG Scores 0% on Multi-Hop Queries — and What Two Lines of Code Changed

Emil — Mon, 13 Apr 2026 18:57:14 +0000

## The Problem Nobody Talks About

Ask your RAG system: "What award did the director of Inception win?"

This requires two hops:

Inception → Christopher Nolan
Christopher Nolan → Academy Award

Your retrieval engine does hop 1 fine. But hop 2? The embedding of the original query is nowhere near "Academy Award" in vector space. The answer sits at rank 665. Your top-20 retrieval window never sees it.

We tested this systematically on HotpotQA fullwiki — 5.2M Wikipedia articles, 500 multi-hop questions.

Every traditional method scored 0% Hit@20. BM25. Dense retrieval. Rerankers. All of them.

What If the Query Could Change Shape?

In 1958, Daniel Koshland proposed the induced-fit model of enzyme binding. Unlike the rigid "lock and key" model, enzymes change their shape to fit the substrate.

We applied the same principle to retrieval.

At each hop, IFR mutates the query embedding based on what it just found. The query literally reshapes itself to reach the next piece of evidence.

Query → [hop 1: find Film X] → mutate → [hop 2: find director] → mutate → [hop 3: find award] → found

The Drift Problem

This sounds elegant on paper. In practice, v1 was a disaster.

67% of failures came from catastrophic drift — the query mutated so aggressively that by hop 3, it had lost >80% of its original meaning. It was finding documents, but completely wrong ones.

We tested 8 drift correction approaches:

PID controllers
Sentinel beams
Moving anchors
Drifting anchors
Threshold tuning
Hierarchical traversal
Attention-based edge weighting
Swarm coordination (Boids)

Most made things worse. The winner was embarrassingly simple:

# Blend 50% of original query at every hop
query_vector = 0.5 * mutated + 0.5 * original

# Hard reset if drift exceeds threshold
if cosine_sim(query_vector, original) < 0.5:
    query_vector = original

Two lines of code. nDCG went from 0.197 to 0.317 (+61%).

Benchmark Results

Tested on HotpotQA fullwiki: 5.2M Wikipedia articles, 500 questions, 3 random seeds, single RTX 3060.

Method	R@5	R@10	MRR
RAG-rerank baseline	0.337	0.337	0.548
IFR-hybrid+CE	0.366	0.366	0.554
Delta	+2.9% (p=0.0002)	+2.9%	+0.6%

R@5 = R@10 because IFR surfaces all retrievable targets within the top 5 — ranks 6–10 add no new hits at this difficulty level.

Scaling: O(1) latency — 100x data growth = 1.1x latency growth. Beam traversal takes ~10ms on the full 5.2M corpus.

Why Three Layers Beat Perfect Traversal

Raw beam search R@5 = 0.309. With cross-encoder reranking: 0.366 (+5.7 points).

The insight: drift noise scores high against the mutated query but low against the original. So the cross-encoder naturally filters it. Trying to eliminate drift at the beam level gives diminishing returns. The multi-layer pipeline is the actual solution.

Limitations (We're Honest)

The 50% blend ratio is empirical. We don't have a principled method for setting it.
Tested only on HotpotQA fullwiki. Other multi-hop benchmarks needed.
Single GPU (RTX 3060). Not benchmarked at enterprise scale. ---

Question for the community:
We fixed drift with a static 50% anchor blend — but this feels like a brute-force solution. Has anyone worked on adaptive blending that adjusts the anchor weight based on query complexity or hop distance? Curious what approaches you've tried.

github.com/emil-celestix/celestix-ifr

Beyond Static RAG: Using 1958 Biochemistry to Beat Multi-Hop Retrieval by 14%

Emil — Wed, 01 Apr 2026 04:16:13 +0000

Standard Retrieval-Augmented Generation (RAG) often falls short on complex, multi-hop questions because it relies on static "lock and key" query matching. If the information needed to answer a query is semantically distant from the original text, standard vector search simply won't find it.

We've developed Induced-Fit Retrieval (IFR), a dynamic graph traversal approach that mutates the query vector at every step to discover semantically distant but logically connected information.

The Core Results
We ran our prototype through a rigorous test suite of 30 queries across multiple graph sizes, up to 5.2 million atoms.

14.3% higher nDCG@10 compared to a competitive RAG-rerank baseline.

15% Multi-hop Hit@20 in scenarios where traditional RAG methods scored 0%.

O(1) Latency Scaling: Latency remains near 10ms whether searching 100 atoms or 5.2 million.

Why Biochemistry?
The system is inspired by Daniel Koshland’s 1958 "induced fit" model. In biology, enzymes change shape upon encountering a substrate to improve binding.

IFR applies this to Information Retrieval: instead of a static query vector, the vector mutates at each hop based on the visited node's embedding. This allows the query to follow the "curved manifolds" of high-dimensional embedding space that a fixed vector cannot reach.

Lessons from the Data
Transparency is key to research, so we are also sharing our failures:

Catastrophic Drift: 67% of our failures occurred because the query mutated too aggressively, losing its original intent.

The Solution: v2 will implement an "Alpha Floor" to preserve at least 50% of the original query signal at all times.

We have open-sourced the prototype, our 18 raw JSON result logs, ablation studies, and full technical reports.

Check out the repo on GitHub:
https://github.com/emil-celestix/celestix-ifr