DEV Community

Cover image for 80% of RAG Failures Start Here (And It's Not the LLM)
RAGPrep
RAGPrep

Posted on

80% of RAG Failures Start Here (And It's Not the LLM)

A team spent three weeks debugging hallucinations in their RAG system. They tried different prompts. They swapped embedding models. They tuned retrieval parameters.

The LLM wasn't the problem.
The retriever wasn't the problem.
The chunks were the problem.

The setup

Fixed-size chunking, 512 tokens, 10% overlap. Standard configuration. Nothing obviously wrong at first glance.

But when we scored the chunks:

  • 34% had completeness scores below 0.4
  • 28% were orphan chunks — fragments with no surrounding context
  • 19% duplicated information already in adjacent chunks

12,000 embeddings in their vector database.
4,000 of them were low quality.
They were paying to store, retrieve, and feed garbage to their LLM.

The specific failure

A user asks: "What's the load capacity of the X400?"

The retriever returns the 3 most semantically similar chunks:

  • "The X400 is designed for industrial use..."
  • "Load capacity specifications vary by model..."
  • "See table 4 for complete specifications..."

Table 4 had been split across 3 chunks during ingestion, each missing the context to be useful. The LLM received three fragments that pointed to an answer without containing one. It hallucinated.

Why this happens

Most chunking strategies optimise for speed and simplicity, not quality. Fixed-size chunking splits documents at token boundaries with no awareness of semantic content.
A sentence that starts in one chunk and ends in another produces two orphan fragments, each useless in isolation.

The problem compounds at scale. In a demo with 50 hand-picked documents, you never hit these edge cases.
In production with 50,000 documents from multiple sources, they're everywhere.

The fix

Score your chunks before you embed them.

Specifically, check:

  1. Completeness — does the chunk contain a complete thought?
  2. Semantic density — what ratio of the chunk is meaningful signal vs boilerplate?
  3. Context sufficiency — could this chunk answer a question on its own?

Chunks that fail these checks should be merged, re-chunked, or filtered before they hit your vector database.

A 2025 study on RAG systems found that optimising chunk quality improved faithfulness scores from 0.47 to 0.82 — a 74% improvement. The embedding model didn't change.
The retriever didn't change. Only the chunk quality changed.

The problem is almost always upstream of where you're looking.


I built ChunkScore to solve this problem — free chunk quality auditor, no signup required.
Works on chunks from LangChain, LlamaIndex, Chonkie, or any JSON array.

Top comments (0)