We measured baseline accuracy across a production RAG system: 62%.
Six weeks later, after five architecture changes and zero model changes: 94%.
Here's exactly what we changed, why each one mattered, and what the numbers looked like before and after each step.
The Setup
Internal knowledge assistant for a mid-market company. Knowledge spread across Confluence, Google Drive, and SharePoint, approximately 4,200 documents total. Users asking natural language questions about internal policies, processes, and product specifications.
Stack at baseline:
| Component | Details |
|---|---|
| LLM | GPT-4o |
| Embedding model | text-embedding-3-large |
| Vector store | Pinecone |
| Chunking | RecursiveCharacterTextSplitter, 1024 tokens, 20% overlap |
| Retrieval | top-8 by cosine similarity, no re-ranking |
| Eval | none |
This worked well in testing. In production, it hit a ceiling at week three — when real users arrived with queries that went beyond the clean, structured examples we'd tested against.
The first sign was a CTO call. A VP had used the system for a cross-departmental policy question, gotten a confident and wrong answer, and forwarded it to three people. The system had retrieved from one policy document and missed a contradicting clause in another.
Before touching anything, we measured.
Evaluation First — Before Any Changes
150 real queries pulled from production logs. Not test cases we designed, actual user queries from the support logs, stratified across query types.
Reference answers written by domain experts. Automated grading via RAGAS across four dimensions: faithfulness, answer relevance, context precision, context recall. Manual spot-check on the ~15% of answers that fell in scoring edge cases.
Baseline results:
| Metric | Score |
|---|---|
| Overall accuracy | 62% |
| Multi-document queries | 41% |
| Exact-match retrieval | 58% |
| False confidence rate* | 68% |
*Proportion of wrong answers with no hedging language — the model sounded equally certain whether it was right or wrong.
Without this baseline, none of the following changes would have been verifiable. Build the eval suite first. Always.
Change 1: Semantic Chunking
Impact: 🔴 High
Fixed-window chunking at 1024 tokens truncates documents at arbitrary boundaries. For policy documents, contracts, and multi-page specs, this severs logical relationships between sections, a conditional clause in section 3 that's resolved by a definition in section 7 ends up in two different chunks with no retrieval path connecting them.
We moved to semantic chunking: split on natural topic/section boundaries detected via sentence-level embedding similarity. Consecutive sentences are grouped while their cosine similarity stays above a threshold; a new chunk starts when similarity drops below it.
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def semantic_chunk(sentences, embed_fn, threshold=0.75):
"""
Split sentences into semantically coherent chunks.
Args:
sentences: list of sentence strings
embed_fn: function that returns an embedding for a string
threshold: minimum similarity to stay in the same chunk
Returns:
list of chunk strings
"""
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
prev_emb = embed_fn(sentences[i - 1])
curr_emb = embed_fn(sentences[i])
similarity = cosine_similarity(
np.array(prev_emb).reshape(1, -1),
np.array(curr_emb).reshape(1, -1)
)[0][0]
if similarity >= threshold:
current_chunk.append(sentences[i])
else:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
chunks.append(" ".join(current_chunk))
return chunks
If you're using LlamaIndex: SemanticSplitterNodeParser with breakpoint_percentile_threshold=85 is a solid starting point. Tune the threshold per document type — policy and legal docs generally benefit from higher thresholds (split less often).
Tradeoffs to expect:
- Preprocessing is significantly heavier than fixed-window
- Variable chunk lengths complicate token budget management in your context window
- Ingestion pipeline needs updating if you have document update automation
For read-heavy internal assistants with low update frequency, the accuracy gain outweighs these costs.
Accuracy improvement on multi-hop queries after this change alone: +31% relative.
Change 2: Hybrid Search with Reciprocal Rank Fusion
Impact: 🔴 High
Semantic/vector search is excellent at concept-level similarity. It's weak at exact matching — product codes, regulation identifiers, version strings, internal project names, anything where a user is looking for a specific term rather than a concept.
We added a BM25 keyword index running parallel to the Pinecone vector index. Retrieved top-20 candidates from each independently, then fused them with Reciprocal Rank Fusion (RRF):
def reciprocal_rank_fusion(
vector_results: list[str],
bm25_results: list[str],
k: int = 60
) -> list[str]:
"""
Merge two ranked lists using Reciprocal Rank Fusion.
Args:
vector_results: doc IDs ranked by vector similarity
bm25_results: doc IDs ranked by BM25 keyword score
k: smoothing constant (60 is standard, rarely needs tuning)
Returns:
Merged list of doc IDs sorted by fused score (best first)
"""
scores: dict[str, float] = {}
for rank, doc_id in enumerate(vector_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, doc_id in enumerate(bm25_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
If you're using LangChain: EnsembleRetriever handles this cleanly. Set weights=[0.4, 0.6] for BM25 and vector respectively as a starting point, and adjust based on how many exact-match queries your workload has. The weights matter less than you'd think if you're re-ranking afterward.
from langchain.retrievers import BM25Retriever, EnsembleRetriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 20
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6]
)
If your knowledge base contains any structured identifiers, and it almost certainly does, vector-only retrieval is not sufficient. This was the change that fixed the most visible failure class in our case.
Change 3: Cross-Encoder Re-ranking
Impact: 🟡 Medium-High
Vector retrieval and BM25 give you candidate chunks. They rank candidates by how similar they are to the query embedding or how well they match keywords. Neither metric is the same as: how useful is this specific chunk for answering this specific question?
A cross-encoder re-ranker scores each chunk-query pair jointly; it reads both together, not independently, which produces relevance scores much more aligned with what the LLM actually needs.
We used ms-marco-MiniLM-L-6-v2 from sentence-transformers: small enough to run without meaningful infrastructure cost, effective enough to make a real difference.
from sentence_transformers import CrossEncoder
from dataclasses import dataclass
@dataclass
class Chunk:
text: str
metadata: dict
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query: str, candidates: list[Chunk], top_k: int = 5) -> list[Chunk]:
"""
Re-rank candidate chunks by relevance to the query.
Args:
query: the user's question
candidates: retrieved chunks from vector + BM25 retrieval
top_k: how many to pass to the LLM context window
Returns:
Top-k chunks ranked by cross-encoder relevance score
"""
pairs = [(query, chunk.text) for chunk in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
return [chunk for _, chunk in ranked[:top_k]]
If you're using LangChain:
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
compressor = CrossEncoderReranker(model=model, top_n=5)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=ensemble_retriever # from Change 2
)
Latency cost: ~200–400ms per query on CPU inference with MiniLM. For an internal assistant where users expect 3–4s response times, this is acceptable. For customer-facing chatbots with sub-1s SLAs, benchmark first and consider skipping re-ranking on queries classified as simple (single-document, low complexity).
Change 4: Source Hierarchy + Metadata Tagging
Impact: 🟡 Medium
When Confluence has a "remote work policy" document updated in March, and SharePoint has a different version updated in January, and both are in your knowledge base — which one does the retriever trust?
Without source hierarchy metadata, the LLM gets both chunks, infers authority from the text, and guesses. It guesses wrong often enough to matter.
We added three metadata fields to every document on ingestion:
from langchain.schema import Document
from datetime import datetime
def create_document_with_metadata(
text: str,
source: str,
source_authority: int, # 1=primary, 2=secondary, 3=supplementary
last_updated: datetime,
domain: str
) -> Document:
"""
Wrap chunk text with authority metadata for conflict resolution.
source_authority levels:
1 = Primary (HR portal, official policy docs, product spec sheets)
2 = Secondary (team wikis, internal guides, how-to docs)
3 = Supplementary (Slack archives, meeting notes, draft docs)
"""
return Document(
page_content=text,
metadata={
"source": source,
"source_authority": source_authority,
"last_updated": last_updated.isoformat(),
"domain": domain
}
)
# Retrieval with authority filter — prefer primary sources
vector_retriever = vectorstore.as_retriever(
search_kwargs={
"k": 20,
"filter": {"source_authority": {"$lte": 2}} # primary + secondary only
}
)
The heuristic is simple: when two sources conflict on the same topic, source_authority=1 wins. This eliminates the most embarrassing failure class — confident wrong answers sourced from an outdated secondary document when the primary existed and said something different.
This change required no model updates and no prompt changes. It's a data pipeline discipline issue, not an AI problem.
Change 5: Structured Evaluation Suite (The One Most Teams Skip)
Impact: 🔴 Foundational
This is the process change, not the architecture change. And it's the one that made every other change above verifiable rather than speculative.
What we built:
- 150 representative queries sampled from production logs, not cherry-picked happy-path examples, not queries we designed. Stratified across query types: single-document, multi-document, exact-match, comparative, conditional.
- Reference answers written by domain experts for each query, the ground truth the system should produce.
- Automated RAGAS grading across faithfulness, answer relevance, context precision, and context recall.
- Manual spot-check on ~15% of borderline-scored answers.
- CI gate: eval suite runs on every deployment. A drop of >3% on any category blocks the deployment.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
from datasets import Dataset
def run_eval(questions, answers, contexts, ground_truths):
"""
Run RAGAS evaluation against reference answers.
Returns a dict with per-metric scores and overall accuracy.
"""
data = {
"question": questions,
"answer": answers,
"contexts": contexts, # list of lists — retrieved chunks per question
"ground_truth": ground_truths
}
dataset = Dataset.from_dict(data)
result = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall
]
)
return result
Without this eval pipeline, you're flying blind. Developers eyeball outputs and call it "seems good." Users report the failures that are loud enough to surface, a heavily biased sample of what's actually failing. The 38% failing quietly never gets fixed.
Final Results
All five changes applied, same LLM throughout:
| Metric | Baseline | After | Delta |
|---|---|---|---|
| Overall accuracy | 62% | 94% | +32pp |
| Multi-document reasoning | 41% | 87% | +46pp |
| Exact-match retrieval | 58% | 96% | +38pp |
| False confidence rate | 68% | 12% | -56pp |
The false confidence number is the one I find most significant. 68% of wrong answers originally had zero hedging language — the model sounded equally certain whether it was right or wrong. After fixing the retrieval layer (without touching the model or prompt), that rate dropped to 12%.
Overconfidence wasn't a model behavior problem. It was retrieval quality expressing itself as false certainty. Fix what goes into the context window, and the model's calibration improves without any prompt engineering.
What I'd Do Differently
Start with the eval suite. Not after you get complaints, before you go live. The 4–6 hours of building a 150-query eval set with reference answers will save you weeks of debugging production failures blind.
Build for the full query distribution, not the happy path. Your test suite during development is systematically biased toward queries you designed. Real users will find the edges. Measure in production with real queries before you tune anything.
The model is rarely the bottleneck. If your RAG system is hallucinating, look at retrieval before you look at the LLM. Changing the model is expensive, slow, and usually doesn't fix a retrieval quality problem. Changing the retrieval architecture is faster and the impact is larger.
Still On the List
Things we haven't tested yet on this system:
- HyDE (Hypothetical Document Embeddings) for query expansion — benchmarks look promising, mixed reports on real-world gains for knowledge-domain workloads
- Query decomposition for explicitly multi-hop questions — breaking compound queries into sub-queries before retrieval
- Dynamic eval set refresh — our static 150-query set is starting to feel unrepresentative as the knowledge base and user base evolve
Happy to go deeper on any of these in the comments, or on RAGAS configuration and the source hierarchy schema if there's interest.
Sunil Kumar is CEO of Ailoitte, an AI-native product engineering company that builds production-grade AI systems for startups, enterprises, and regulated industries. Ailoitte's AI Velocity Pods deliver fixed-price, outcome-based software development — including production RAG infrastructure with evaluation pipelines built in from day one.
Hitting a RAG quality wall? Let's talk.
Top comments (0)