DEV Community

NARESH
NARESH

Posted on

Why Most RAG Systems Hallucinate — And How My Hybrid Pipeline Fixes It

Banner

TL;DR

Most RAG systems don't hallucinate because the model is weak.

They hallucinate because retrieval is weak.

In a recent project, I built what looked like a solid RAG pipeline: embeddings, vector database, top-K retrieval, LLM synthesis. It worked beautifully in demos.

Until it didn't.

When I pushed it beyond surface-level queries, subtle cracks appeared:

  • The same idea retrieved five times.
  • Exact keywords silently missed.
  • Shallow answers that sounded confident.
  • Responses generated even when the data wasn't actually there.

Nothing was obviously broken.

But something was structurally wrong.

That's when I realized a hard truth:

If retrieval is narrow, the model will be narrow.

If retrieval is weak, the model will guess.

So I rebuilt the pipeline from the ground up.

pipeline

Instead of relying solely on vector similarity, I implemented:

  • Hybrid dense + sparse retrieval
  • Reciprocal Rank Fusion for fair ranking
  • Cross-encoder reranking for precision
  • MMR to eliminate context echo
  • A retrieval confidence gate that forces the system to say "I don't know"

The result wasn't just better answers.

It was a system that prioritizes relevance, diversity, and honesty over blind similarity.

Because reliable AI doesn't start at generation.

It starts at retrieval.


A few months ago, in one of my recent projects, I built what I thought was a solid RAG pipeline.

It had all the usual ingredients. A vector database. Embeddings. Top-K retrieval. An LLM synthesizing responses. On paper, it looked impressive. In demos, it sounded impressive too.

Ask a question it answered smoothly. Confidently. Almost elegantly.

And honestly? That confidence was the problem.

At first, everything felt magical. You type something in, and the system responds like it has been studying your documents for years. It felt like I had built a research assistant that never sleeps.

But then I started pushing it harder.

Instead of surface-level queries, I asked deeper, more specific questions. Questions that required precision. Questions that needed broader context. Questions where guessing would be dangerous.

That's when I realized something uncomfortable.

Just because a RAG system returns an answer… doesn't mean it truly understands the context it retrieved.

Sometimes it was giving answers that looked correct but felt shallow. Other times it confidently stitched together information that didn't fully represent the bigger picture. It wasn't completely wrong but it wasn't deeply right either.

And that distinction matters.

Because in real-world systems, "almost correct" is often worse than clearly wrong. Clearly wrong can be fixed. Almost correct can slip through unnoticed.

That moment changed how I approached retrieval.

I stopped thinking of RAG as "vector search + LLM."

Instead, I started seeing it as an engineering problem about information quality, ranking logic, diversity, and uncertainty management.

This blog is about that shift.

It's about how I moved from a basic retrieval setup to a scalable, hybrid, confidence-aware RAG architecture in my recent project and what that journey taught me about why most systems quietly hallucinate without anyone realizing it.

Before we talk about solutions, we need to understand where the cracks begin.


Why Normal RAG Breaks at Scale

On the surface, a basic RAG system feels straightforward.

You take a query.

You convert it into an embedding.

You search for the most similar chunks.

You send the top few to the LLM.

You generate an answer.

Simple. Clean. Elegant.

And honestly, for small demos or controlled datasets, it works surprisingly well.

But the cracks start appearing the moment your dataset grows or your questions become more nuanced.

Now, I've seen many discussions around this. The common suggestion is: "Just add an agent layer." Or "Use a graph-based RAG." Or "Plug in some advanced orchestration framework." There's nothing wrong with those approaches. They're powerful in the right context.

But here's the thing.

You can stack agents, graphs, chains, and orchestration layers on top of a weak retrieval core and it's still weak underneath.

This blog isn't about adding more abstraction layers.

It's about strengthening the foundation.

Because the fundamental issue in most RAG systems is this: vector similarity optimizes for closeness not completeness.

Imagine you ask, "How does our authentication system handle token refresh logic?"

The vector search engine scans embeddings and pulls chunks that are semantically closest. That sounds good. But semantic similarity has a bias it clusters around dominant ideas.

If your documentation heavily discusses authentication, the top 5 results may all describe the same subsection in slightly different words.

To the retrieval engine, that's success.

To the LLM, that's limited perspective.

It's like assembling a research team and accidentally hiring five specialists from the exact same department. You'll get depth in one direction but not breadth.

Now layer in another issue.

Vector search is excellent at understanding meaning. But it's not great at exact precision. If someone searches for a specific phrase, an acronym, or a configuration key, embeddings may "understand the theme" but miss the literal match that matters.

So now you have two subtle but critical weaknesses:

  • The system retrieves highly similar content instead of diverse content.
  • It sometimes misses exact keyword matches that are essential for precision.

Individually, these seem minor.

At scale, they compound.

And when compounded, they produce the most dangerous type of AI behavior:

Answers that sound correct but are built on incomplete context.

That's when I stopped thinking about adding more AI layers.

Instead, I focused on building a retrieval pipeline strong enough to survive production.

Because if retrieval is weak, no amount of agent magic will save you.


The First Shift: Why I Stopped Relying on Vectors Alone

Once I understood that similarity alone wasn't enough, I had to ask a harder question:

If vector search isn't sufficient by itself… what exactly is missing?

The answer turned out to be balance.

Vector embeddings are brilliant at capturing meaning. If someone searches for "user login security," the system understands that authentication, session management, and tokens are related even if the exact phrase doesn't match. That semantic awareness is powerful.

But it's also fuzzy by design.

And fuzziness is dangerous when precision matters.

For example, if someone searches for a very specific term say a configuration flag, a library name, or an integration keyword embeddings might treat it as just another semantic signal. If that term doesn't appear frequently enough, it can get buried under broader conceptual matches.

That's when I realized something simple but important:

Search needs two brains.

One brain that understands meaning.

One brain that respects exact words.

That's when I implemented hybrid retrieval.

Instead of choosing between dense vector search and sparse keyword search, I used both.

  • The dense layer (vector search) handles intent. It understands context, relationships, and semantic closeness.
  • The sparse layer (BM25-style retrieval) handles precision. It respects exact term frequency and inverse document frequency. If a phrase exists verbatim, it surfaces it aggressively.

Think of it like this:

  • Dense search asks, "What is this query about?"
  • Sparse search asks, "Where exactly does this appear?"

When I combined them, something interesting happened.

The system stopped leaning too heavily in one direction. It stopped overvaluing abstract similarity and started respecting literal relevance as well.

But combining two retrieval systems introduced a new problem.

They speak different languages.

Vector search produces similarity scores based on embedding space. BM25 produces scores based on term statistics. Their scoring scales are completely different. You can't just add them together and hope for the best.

So the next challenge wasn't retrieval.

It was fusion.

And that's where things started getting more interesting.


Merging Two Worlds: Why Ranking Matters More Than You Think

Once I had both dense and sparse retrieval running, I felt confident again.

The system could understand meaning and respect exact terms. That was a big upgrade.

But then a new question appeared.

If both systems return their own top results… how do you combine them?

At first glance, it seems simple. Just merge the lists. Or maybe average the scores.

But here's the problem.

Vector similarity scores and BM25 scores are not comparable. One might range between 0.2 and 0.9. The other might produce values like 8.7 or 15.3 depending on term frequency. They operate on completely different mathematical scales.

Trying to directly combine those numbers is like averaging temperatures measured in Celsius and exam scores out of 100.

It looks scientific. It's not.

That's when I implemented Reciprocal Rank Fusion (RRF).

Reciprocal Rank Fusion

Instead of trusting the raw scores, RRF trusts position.

It asks a much simpler question:

"How highly did this document rank in each system?"

If a chunk appears near the top in both dense and sparse retrieval, that's a strong signal. If it ranks well in one and poorly in the other, it still gets some credit but less.

Mathematically, RRF assigns a blended score using the formula:

1 / (k + rank)

Where "rank" is the position in each list.

What I liked about this approach is its humility.

It doesn't pretend the scores are comparable. It only respects ordering.

And ordering is what matters in retrieval.

After applying RRF, the top results felt… balanced.

Documents that were semantically relevant but lacked keyword precision didn't dominate the list. Documents with exact matches but weak context didn't dominate either.

The best of both worlds naturally floated to the top.

But even then, I noticed something.

Even when ranking was balanced, some results still weren't answering the question directly. They were related. Contextually relevant. But not tightly aligned with the exact query phrasing.

That's when I realized:

Independent scoring isn't enough.

The query and document need to be read together.

And that led to the next upgrade.


Reading the Question and the Context Together: Why I Added Cross-Encoder Reranking

After Reciprocal Rank Fusion, the results were cleaner. Balanced. Fair. Much stronger than simple vector search.

But something still bothered me.

The ranking was good but not always precise.

Sometimes the top result was clearly related to the topic, but it didn't directly answer the question being asked. It was like hiring a knowledgeable consultant who understands the industry… but avoids the exact question you asked.

The issue was subtle.

In both dense and sparse retrieval, documents are scored independently from the query. Even in vector search, embeddings are created separately for the query and for each chunk. The similarity score is just a mathematical distance between two vectors.

That's powerful but it's still indirect.

The model never truly "reads" the question and the document together.

So I introduced a second-stage reranking layer using a cross-encoder model.

And this changed everything.

Unlike embedding-based retrieval, a cross-encoder feeds the query and the document into the same transformer at the same time. It doesn't compare two precomputed representations. It processes them jointly.

Think of it like this:

  • Vector search says, "These two pieces of text feel similar."
  • A cross-encoder says, "Let me actually read them together and judge whether this chunk directly answers this question."

That distinction is huge.

After applying cross-encoder reranking to the top 15 results from fusion, I could see the improvement immediately. The chunk that most directly addressed the user's query consistently moved to position #1.

The system stopped being "generally relevant."

It became specifically relevant.

But here's something interesting.

Even with precise reranking, another issue remained and this one was more subtle than ranking or scoring.

It wasn't about correctness.

It was about diversity.

Because sometimes, even the top-ranked chunks were all saying the same thing in slightly different ways.

And that's where the next breakthrough happened.


Breaking Redundancy: How Maximal Marginal Relevance Changed the Game

Even after hybrid retrieval and cross-encoder reranking, something still felt off.

The top results were relevant. Precisely ranked. Strongly aligned with the query.

But when I looked at the final context being sent to the LLM, I noticed a pattern.

The top five chunks were often different paragraphs… explaining the same idea.

It wasn't wrong.

It was repetitive.

And repetition is dangerous in RAG.

Because the LLM doesn't know you accidentally fed it five versions of the same thought. It just sees five supporting signals and assumes that idea must be extremely important.

That's how answers become narrow without you realizing it.

That's when I discovered Maximal Marginal Relevance (MMR).

At first, I'll be honest I didn't fully understand it. The name sounds intimidating. It feels academic. But once I broke it down, it turned out to be surprisingly intuitive.

MMR doesn't just ask:

"How relevant is this document to the query?"

It also asks:

"How different is this document from the ones I've already selected?"

That second question is the magic.

Here's how it works conceptually.

First, it selects the most relevant document easy choice.

For the next selection, it looks for a chunk that is still highly relevant to the query, but also dissimilar to the chunk already chosen.

It balances two forces:

  • Relevance to the question
  • Novelty compared to selected context

Think of it like curating a panel discussion.

You want experts who understand the topic but you don't want five people who all share the exact same viewpoint.

When I implemented MMR after reranking, the difference was visible immediately.

Instead of five similar paragraphs reinforcing the same section, the LLM received a broader slice of the system. Different angles. Different components. Different layers.

And suddenly, the answers became more complete.

More grounded.

More balanced.

It wasn't just about avoiding repetition.

It was about giving the model room to reason across perspectives.

And this was the moment I realized something important.

Most hallucinations aren't caused by lack of intelligence.

They're caused by lack of diversity in context.

But even with hybrid retrieval, fusion, reranking, and MMR… one final problem remained.

What happens when the database simply doesn't contain the answer?

That's where the most important safeguard comes in.


When the System Should Stay Silent: Retrieval Confidence & Hallucination Guards

Hallucination Guards

Up to this point, the pipeline was strong.

  • Hybrid retrieval gave balance.
  • RRF gave fairness.
  • Cross-encoder reranking gave precision.
  • MMR gave diversity.

The answers were dramatically better.

But there was still one uncomfortable scenario I had to confront.

What happens when the answer simply isn't in the database?

In a basic RAG setup, this is where things get dangerous.

The system still retrieves the "top 5" chunks even if those chunks barely match the query. They might be loosely related. They might share one keyword. But they're not real answers.

And the LLM, being trained to be helpful, tries to construct something anyway.

It doesn't want to disappoint.

So it guesses.

Not maliciously. Not randomly. Just probabilistically.

That's how hallucinations sneak in not because the model is broken, but because retrieval passed weak evidence with full confidence.

That's when I realized something critical:

A serious RAG system must measure its own certainty.

So I added a retrieval confidence layer.

After the cross-encoder reranks the top results, I calculate the average relevance score of the top three chunks. Since the reranker outputs scores between 0 and 1, this gives a clean signal of how strongly the retrieved context aligns with the query.

If that average falls below a threshold in my case, 0.4 the system does something most AI systems rarely do.

It refuses.

Not aggressively. Not dramatically.

Just calmly.

Instead of sending weak context to the LLM, it responds with a graceful message saying there isn't enough verified information to answer confidently.

And this changed the trust dynamics completely.

Because now the system isn't just optimized for answering.

It's optimized for answering responsibly.

In production systems, trust is more valuable than cleverness.

A model that occasionally says "I don't know" is far more powerful than one that always pretends it does.

And at this point, the retrieval layer wasn't just strong.

It was honest.

But building a reliable system isn't only about correctness.

It's also about efficiency.

Because even the smartest pipeline becomes impractical if it's slow or expensive.

And that's where optimization came in.


Making It Fast and Lean: Optimization, Compression, and Caching

Once the retrieval pipeline became reliable, a new reality set in.

Strong pipelines are expensive.

  • Hybrid retrieval means two searches.
  • Reranking means another model call.
  • MMR means additional computation.
  • Confidence checks add orchestration logic.

All of that improves quality but it also increases latency and token usage.

And in production, latency and token cost are not small details. They are architectural constraints.

So I had to optimize.

The first improvement came from payload design.

When fetching structured data from tools like databases or repositories, the raw JSON responses were verbose. Repeated keys. Deep nesting. Redundant structures. Sending that directly to the LLM would waste tokens on formatting instead of reasoning.

So I introduced a lightweight compression wrapper that restructures tool outputs into a minimal, structured format. Same information. Fewer repeated tokens. Cleaner context.

Think of it like summarizing a spreadsheet before handing it to an analyst. You remove the noise, keep the signal.

This significantly reduced token consumption without sacrificing clarity.

The second optimization was caching.

In real-world usage, users often ask similar questions repeatedly. If a response has already been generated confidently, there's no reason to recompute the entire retrieval pipeline every time.

So I added multi-layer caching.

  • High-confidence LLM responses get cached.
  • Tool responses get cached.
  • Retrieval steps can be cached when appropriate.

The result?

Repeated queries resolve in milliseconds.

Which means the system isn't just accurate.

It's responsive.

And that's when the pipeline finally felt complete.

Not just smart.

Not just safe.

But scalable.

At this stage, I stepped back and looked at what had evolved.

What started as "vector search + LLM" had turned into a layered retrieval architecture with:

  • Dual retrieval brains
  • Fair ranking fusion
  • Precision reranking
  • Diversity enforcement
  • Confidence-based refusal
  • Token-efficient payload design
  • Intelligent caching

And the difference in answer quality was not incremental.

It was structural.


Lessons Learned: What Building a Serious RAG System Taught Me

When I started, I thought RAG was mostly about embeddings.

Generate vectors.

Store them.

Retrieve top results.

Send to LLM.

Done.

But building a serious, scalable pipeline changed how I think about AI systems entirely.

The biggest lesson?

Retrieval is not a feature.

It's a responsibility.

LLMs are incredibly capable, but they are not fact-checkers. They are pattern synthesizers. If you give them narrow context, they produce narrow answers. If you give them weak evidence, they fill in the gaps. And if you give them repetitive context, they amplify it.

The model is only as grounded as the retrieval layer beneath it.

Another lesson was this:

Relevance alone is not enough.

You need balance between semantic understanding and literal precision. You need ranking logic that respects both. You need diversity in context so the model can reason across perspectives. And most importantly, you need a mechanism that knows when to stop.

Because sometimes the most intelligent response a system can give is:

"I don't know."

And strangely, adding that constraint made the system stronger not weaker.

The final realization was architectural.

You can stack agents, tools, orchestration layers, and complex workflows on top of RAG. But if the retrieval foundation is weak, everything built on top will wobble.

A strong AI system isn't defined by how many components it has.

It's defined by how intentionally those components interact.

Building this pipeline in my recent project forced me to move beyond "it works" and into "it works reliably under pressure." And that shift from experimentation to production thinking was the real evolution.

This isn't the final form of RAG. Retrieval will keep evolving. Adaptive pipelines, feedback loops, dynamic context windows there's a lot ahead.

But one thing is clear.

If we want AI systems that are trustworthy, scalable, and responsible, we have to engineer retrieval with the same seriousness we engineer models.

Because the quality of answers doesn't begin at generation.

It begins at retrieval.


🔗 Connect with Me

📖 Blog by Naresh B. A.

👨‍💻 Building AI & ML Systems | Backend-Focused Full Stack

🌐 Portfolio: Naresh B A

📫 Let's connect on [LinkedIn] | GitHub: [Naresh B A]

Thanks for spending your precious time reading this. It's my personal take on a tech topic, and I really appreciate you being here. ❤️

Top comments (0)