The RAG System That Retrieved Perfect Chunks (But Answered Wrong Anyway)

#rag #chunking #ai #discuss

I built a RAG system for a customer support knowledge base. It retrieved relevant documentation chunks and used them to answer questions. Retrieval accuracy was ninety six percent. Answer accuracy was thirty two percent.
The retrieval worked perfectly. The answers were completely wrong.

The Setup
Enterprise software company with eight hundred pages of technical documentation. They wanted an AI that could answer customer questions using this knowledge base instead of forcing customers to search manually.
Standard RAG architecture. Customer asks question, system embeds the query, searches vector database for most relevant chunks, feeds those chunks to the LLM with the question, LLM generates answer using the retrieved context.
I tested retrieval quality first. For one hundred sample questions, the system retrieved the correct documentation sections ninety six times. Nearly perfect retrieval.
Then I tested end-to-end answers. Out of the same one hundred questions, only thirty two answers were actually correct or helpful. The rest were wrong, incomplete, or misleading.
The retrieval was flawless. The answer generation was broken.

Why This Happened
The problem was not retrieval. The problem was chunk boundaries. The system was retrieving the right paragraphs but those paragraphs did not contain complete information when isolated from surrounding context.
Example question: "How do I reset my API key?"
Retrieved chunk: "Click the regenerate button and confirm. Your old key will stop working immediately."
This chunk is relevant. It mentions API key regeneration. But it is missing critical information. Where is the regenerate button? What menu? What happens to existing API calls? How do I update my code?
That information existed in the documentation, but it was in the paragraph before and the paragraph after the retrieved chunk. The chunking strategy had split one complete procedure into three separate chunks. The retrieval system grabbed the middle chunk and missed the setup and followup steps.
The LLM saw incomplete instructions and either filled in the gaps with hallucinations or gave vague unhelpful answers.

The Chunking Problem
The documentation was chunked by fixed size. Every five hundred tokens became one chunk. Clean. Consistent. Terrible for meaning preservation.
A procedure that said "First go to Settings. Then navigate to API section. Click regenerate button and confirm" got split into two chunks if it crossed the five hundred token boundary. Retrieval might grab the second chunk, which starts mid-procedure.
Tables were even worse. A pricing table got split horizontally. The retrieved chunk had row data without column headers. The LLM could not interpret what the numbers meant.
Lists broke across chunks. A troubleshooting guide with eight steps got split. The user got steps four through six without context of what came before.

The Failed Fix
I tried increasing chunk size to one thousand tokens. That reduced splitting but created a new problem. Chunks became too broad. A chunk about API keys now also included information about OAuth, webhooks, and rate limiting. Retrieval precision dropped because chunks were less focused.
I tried overlapping chunks. Each chunk included the last one hundred tokens of the previous chunk. That helped slightly but created massive redundancy. The vector database size tripled and search became slower.

The Real Solution Was Semantic Chunking
The breakthrough was abandoning fixed-size chunks entirely. Instead, I chunked by semantic boundaries. Procedures stayed together. Tables stayed whole. Lists remained intact.
The new chunking logic identified content types. A procedure section with numbered steps became one chunk regardless of length. A table became one chunk with headers and all rows. A conceptual explanation paragraph became one chunk.
If a section was genuinely too large, it split at natural breakpoints. After a procedure ends but before the next topic begins. After a table but before explanatory text. At heading boundaries, not mid-paragraph.
I also added contextual metadata to each chunk. Every chunk now includes the page title, section heading, and subsection heading it came from. When the chunk is retrieved, the LLM sees not just the paragraph but also where it sits in the documentation hierarchy.

What Changed
Question: "How do I reset my API key?"
Old retrieval: Middle paragraph of procedure. "Click the regenerate button and confirm."
New retrieval: Complete procedure with context. "Section: API Management > Subsection: Key Regeneration To reset your API key:
Navigate to Settings > API Keys
Locate your current key in the list
Click the regenerate button next to it
Confirm the action in the popup
Copy your new key immediately Note: Your old key stops working immediately upon regeneration."
The LLM now had complete instructions with all necessary steps and context about where to find the feature.

The Results
After switching to semantic chunking, I retested the same one hundred questions. Retrieval accuracy stayed at ninety four percent, slightly lower because some chunks were now more specific. But answer accuracy jumped to eighty seven percent.
The system went from retrieving right and answering wrong to both retrieving right and answering right. Customer satisfaction with AI answers went from thirty eight percent to eighty one percent.
Support ticket deflection, the real business metric, increased from eighteen percent to sixty three percent. The AI was finally reducing support load instead of frustrating customers with incomplete answers.

What I Learned
Retrieval accuracy is not the same as answer quality. You can retrieve the exactly correct paragraph and still generate a wrong answer if that paragraph lacks surrounding context.
Fixed-size chunking optimizes for engineering simplicity, not semantic coherence. Real documentation has structure. Procedures have steps. Tables have relationships. Lists have order. Chunking must preserve these structures.
Contextual metadata matters as much as content. Knowing a chunk comes from the API Management section under Key Regeneration helps the LLM understand what it is reading and how it relates to the question.

The Bottom Line
A RAG system with ninety six percent retrieval accuracy produced wrong answers sixty eight percent of the time because chunks were split at arbitrary boundaries that destroyed semantic meaning. The fix was chunking by content structure instead of token count and adding hierarchical context to every chunk.

Written by Farhan Habib Faraz
Senior Prompt Engineer building conversational AI and voice agents

Tags: rag, chunking, retrieval, knowledgebase, vectorsearch, semanticsegmentation