Yogesh23012001

Posted on Jun 28

Advanced RAG Techniques Aren't Better. They're Better Sometimes.

#webdev #programming #rag #ai

I added five retrieval techniques to a RAG pipeline and measured each one. The most useful result was the technique that backfired.

The technique that made retrieval worse

I expected HyDE to improve retrieval.
On one query, it took context precision from 0 to 0.80 — surfacing a chunk my baseline had missed entirely, and ranking it first. A clean win.

On another query, it did the opposite. Recall collapsed from 0.80 to 0.17. The same technique didn't just fail to help — it actively dragged the right chunks out of the results.

Same pipeline. Same technique. Opposite outcomes.

That's the thing nobody tells you under the blog posts titled "Add HyDE to boost your RAG." HyDE isn't better. It's better sometimes — and knowing which times is the actual skill. Every "advanced RAG technique" I added this week turned out to be exactly this: a tool with a tradeoff, not a free upgrade. The work was never adding them. The work was measuring which one earned its complexity, on which query, and why.

This is a post about that measurement.

What I built (and the only question that mattered)

I built a RAG pipeline over Anthropic's own documentation — 15 doc pages, 667 chunks, Postgres + pgvector with an HNSW index — then bolted on five retrieval modes I could switch between and compare head-to-head:

Dense — plain vector similarity (bge-small embeddings).
Hybrid — dense + BM25 keyword search, fused with Reciprocal Rank Fusion.
Reranking — pull a broad candidate set, re-score with a cross-encoder.
HyDE — embed a hypothetical answer instead of the question.
Contextual retrieval — prepend an LLM-written, document-aware summary to each chunk before embedding.

I came at this as a backend engineer, not an ML researcher. And the backend engineer's question about any of these isn't "is it state of the art?" It's the same question you'd ask about a cache layer or a message queue: which of these actually earns its complexity? A RAG system is a distributed system with a model attached. Every component you add is something you have to operate, debug, and pay for. So — which ones pay you back?

Measure the boring baseline first

Before measuring anything fancy, I measured plain dense retrieval over a 28-question eval set:

Faithfulness: 0.96
Context precision: 0.60
Those two numbers point at two completely different problems, and conflating them is the most common RAG mistake I see. RAG fails in two independent places:

Retrieval — did you fetch the right chunks? (context precision / recall)
Generation — did the model answer faithfully from what you fetched? (faithfulness)
My faithfulness was already 0.96. The generator was not the problem — given good context, the model grounded its answer fine. The weak spot was precision at 0.60: roughly 40% of what I was feeding the model was noise. The right chunk was usually in the top-k, just buried in strays.

That reframes the whole project. Every advanced technique I was about to add targets retrieval — and retrieval was exactly the failing half. If faithfulness had been the low number, none of this would have helped; I'd have been tuning prompts instead. You can't know that until you split the metric and look. Look at the data first, then pick the tool.

Where each technique earned its keep — or didn't

Hybrid search (BM25 + dense). Dense retrieval has a blind spot: exact terms. Ask "what does cache_control: {"type": "ephemeral"} do?" and pure semantic similarity drifts toward vaguely-related caching prose. BM25 nailed the exact chunk dense missed entirely — but naive Reciprocal Rank Fusion then demoted it, because the dense ranker outvoted the one sparse ranker that got it right. Lesson: exact-term matching is a real, distinct failure mode, but fusion isn't free — RRF needs weighting, or it averages away the very signal you added it for.

*HyDE *— the rescue and the backfire. On casually-phrased, mismatched queries ("why does Claude keep forgetting what we talked about earlier?"), HyDE is magic: it writes a hypothetical answer full of the docs' actual vocabulary — "context window," "tokens," "compaction" — and embeds that, landing in the cluster the question's own words could never reach. Precision 0 → 0.80. But on queries already well-matched to the corpus, that same hypothetical answer invents detail that pulls retrieval toward the wrong region — recall 0.80 → 0.17. The fix isn't "use HyDE" or "don't." It's query-adaptive routing: apply HyDE only when the query and the corpus speak different languages.

Reranking. A cross-encoder reads query and chunk together, instead of comparing two independently-made vectors. It answered a question hybrid alone had declined — pulling the supporting chunk high enough that the generator finally had what it needed. The throughline: retrieval quality directly controls faithfulness. A faithful generator is downstream of good retrieval; you cannot prompt your way out of bad context.

Contextual retrieval. Anthropic's technique — an LLM writes a per-chunk summary situating it in its document, prepended before embedding. The catch is cost: one LLM call per chunk. Prompt caching is what makes it viable (cache the document, ~$1 per million chunks). It helped least on easy queries and most exactly where dense failed hardest — the starved, context-poor chunks. Situational, like all the rest.

Technique Earns its complexity when…
Hybrid + BM25 : the query hinges on an exact term/param dense smooths over
HyDE : the query is phrased nothing like the docs — and hurts when it already matches
Reranking : the right chunk is retrieved but ranked too low to use
Contextual retrieval : chunks are short/context-starved; you can afford the ingest cost

The detour where I stopped writing a tutorial and started doing infra

I reached for Ragas, the standard RAG eval library. With a Haiku judge, it projected ~11 hours for the full eval matrix. The culprit: Ragas wraps its judge calls in a structured-output layer that retried — and retried — every time the judge's JSON didn't validate, turning each question into an ~8-minute storm of failed calls.

So I read what the four metrics actually compute and built my own async harness. Every judge call returns a single boolean under a trivial schema — so it succeeds first try, no retries — and they all run concurrently under a semaphore. Same metrics. 221 seconds instead of 11 hours. ~50× faster.

That's the line between using a tool and understanding it. When you know your metrics well enough to implement them, you stop being hostage to a black box that's slow for reasons you can't see. Open the box, measure the thing yourself — that instinct is the whole job.

The honest close

None of this is production-ready, and I want to be precise about the gap.

For production I'd build the query router the HyDE backfire demands — classify each query, then pick the retrieval mode instead of applying one blindly. I'd run the eval matrix at real scale (every mode × every category), not the slice I have. The multi-tenant isolation I added (tenant + sensitivity filtering) needs adversarial testing, not just a passing happy path. And plenty is still unmeasured: p99 latency, cost per query, behavior on a corpus I didn't hand-pick.

But the takeaway is one every backend engineer already lives by: these techniques are tools with tradeoffs, and measuring which one helps — for your data, your queries, your failure mode — is the work. The model is new. The discipline isn't.

Top comments (1)

Kartik N V J K • Jun 29

The HyDE result is the cleanest example of this I have seen written down: context precision going 0 to 0.80 on one query and recall collapsing to 0.17 on another is exactly why a single average score hides the truth. What helped me was bucketing queries by type first, since HyDE tends to pay off on vague questions and hurt on ones that already carry the exact keywords. Did you see a pattern in which query types HyDE consistently dragged down?