DEV Community

Cover image for From 62% to 94% RAG Accuracy: The 5 Architecture Changes That Actually Moved the Needle
Sunil Kumar
Sunil Kumar

Posted on

From 62% to 94% RAG Accuracy: The 5 Architecture Changes That Actually Moved the Needle

We measured baseline accuracy across a production RAG system: 62%.

Six weeks later, after five architecture changes and zero model changes: 94%.

Here's exactly what we changed, why each one mattered, and what the numbers looked like before and after each step.

The Setup

Internal knowledge assistant for a mid-market company. Knowledge spread across Confluence, Google Drive, and SharePoint, approximately 4,200 documents total. Users asking natural language questions about internal policies, processes, and product specifications.

Stack at baseline:

Component Details
LLM GPT-4o
Embedding model text-embedding-3-large
Vector store Pinecone
Chunking RecursiveCharacterTextSplitter, 1024 tokens, 20% overlap
Retrieval top-8 by cosine similarity, no re-ranking
Eval none

This worked well in testing. In production, it hit a ceiling at week three — when real users arrived with queries that went beyond the clean, structured examples we'd tested against.

The first sign was a CTO call. A VP had used the system for a cross-departmental policy question, gotten a confident and wrong answer, and forwarded it to three people. The system had retrieved from one policy document and missed a contradicting clause in another.

Before touching anything, we measured.

Evaluation First — Before Any Changes

150 real queries pulled from production logs. Not test cases we designed, actual user queries from the support logs, stratified across query types.

Reference answers written by domain experts. Automated grading via RAGAS across four dimensions: faithfulness, answer relevance, context precision, context recall. Manual spot-check on the ~15% of answers that fell in scoring edge cases.

Baseline results:

Metric Score
Overall accuracy 62%
Multi-document queries 41%
Exact-match retrieval 58%
False confidence rate* 68%

*Proportion of wrong answers with no hedging language — the model sounded equally certain whether it was right or wrong.

Without this baseline, none of the following changes would have been verifiable. Build the eval suite first. Always.

Change 1: Semantic Chunking

Impact: 🔴 High

Fixed-window chunking at 1024 tokens truncates documents at arbitrary boundaries. For policy documents, contracts, and multi-page specs, this severs logical relationships between sections, a conditional clause in section 3 that's resolved by a definition in section 7 ends up in two different chunks with no retrieval path connecting them.

We moved to semantic chunking: split on natural topic/section boundaries detected via sentence-level embedding similarity. Consecutive sentences are grouped while their cosine similarity stays above a threshold; a new chunk starts when similarity drops below it.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def semantic_chunk(sentences, embed_fn, threshold=0.75):
    """
    Split sentences into semantically coherent chunks.

    Args:
        sentences: list of sentence strings
        embed_fn: function that returns an embedding for a string
        threshold: minimum similarity to stay in the same chunk

    Returns:
        list of chunk strings
    """
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        prev_emb = embed_fn(sentences[i - 1])
        curr_emb = embed_fn(sentences[i])

        similarity = cosine_similarity(
            np.array(prev_emb).reshape(1, -1),
            np.array(curr_emb).reshape(1, -1)
        )[0][0]

        if similarity >= threshold:
            current_chunk.append(sentences[i])
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]

    chunks.append(" ".join(current_chunk))
    return chunks
Enter fullscreen mode Exit fullscreen mode

If you're using LlamaIndex: SemanticSplitterNodeParser with breakpoint_percentile_threshold=85 is a solid starting point. Tune the threshold per document type — policy and legal docs generally benefit from higher thresholds (split less often).

Tradeoffs to expect:

  • Preprocessing is significantly heavier than fixed-window
  • Variable chunk lengths complicate token budget management in your context window
  • Ingestion pipeline needs updating if you have document update automation

For read-heavy internal assistants with low update frequency, the accuracy gain outweighs these costs.

Accuracy improvement on multi-hop queries after this change alone: +31% relative.

Change 2: Hybrid Search with Reciprocal Rank Fusion

Impact: 🔴 High

Semantic/vector search is excellent at concept-level similarity. It's weak at exact matching — product codes, regulation identifiers, version strings, internal project names, anything where a user is looking for a specific term rather than a concept.

We added a BM25 keyword index running parallel to the Pinecone vector index. Retrieved top-20 candidates from each independently, then fused them with Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(
    vector_results: list[str],
    bm25_results: list[str],
    k: int = 60
) -> list[str]:
    """
    Merge two ranked lists using Reciprocal Rank Fusion.

    Args:
        vector_results: doc IDs ranked by vector similarity
        bm25_results: doc IDs ranked by BM25 keyword score
        k: smoothing constant (60 is standard, rarely needs tuning)

    Returns:
        Merged list of doc IDs sorted by fused score (best first)
    """
    scores: dict[str, float] = {}

    for rank, doc_id in enumerate(vector_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    for rank, doc_id in enumerate(bm25_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    return sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
Enter fullscreen mode Exit fullscreen mode

If you're using LangChain: EnsembleRetriever handles this cleanly. Set weights=[0.4, 0.6] for BM25 and vector respectively as a starting point, and adjust based on how many exact-match queries your workload has. The weights matter less than you'd think if you're re-ranking afterward.

from langchain.retrievers import BM25Retriever, EnsembleRetriever

bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 20

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]
)
Enter fullscreen mode Exit fullscreen mode

If your knowledge base contains any structured identifiers, and it almost certainly does, vector-only retrieval is not sufficient. This was the change that fixed the most visible failure class in our case.


Change 3: Cross-Encoder Re-ranking

Impact: 🟡 Medium-High

Vector retrieval and BM25 give you candidate chunks. They rank candidates by how similar they are to the query embedding or how well they match keywords. Neither metric is the same as: how useful is this specific chunk for answering this specific question?

A cross-encoder re-ranker scores each chunk-query pair jointly; it reads both together, not independently, which produces relevance scores much more aligned with what the LLM actually needs.

We used ms-marco-MiniLM-L-6-v2 from sentence-transformers: small enough to run without meaningful infrastructure cost, effective enough to make a real difference.

from sentence_transformers import CrossEncoder
from dataclasses import dataclass

@dataclass
class Chunk:
    text: str
    metadata: dict

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query: str, candidates: list[Chunk], top_k: int = 5) -> list[Chunk]:
    """
    Re-rank candidate chunks by relevance to the query.

    Args:
        query: the user's question
        candidates: retrieved chunks from vector + BM25 retrieval
        top_k: how many to pass to the LLM context window

    Returns:
        Top-k chunks ranked by cross-encoder relevance score
    """
    pairs = [(query, chunk.text) for chunk in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
    return [chunk for _, chunk in ranked[:top_k]]
Enter fullscreen mode Exit fullscreen mode

If you're using LangChain:

from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
compressor = CrossEncoderReranker(model=model, top_n=5)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=ensemble_retriever  # from Change 2
)
Enter fullscreen mode Exit fullscreen mode

Latency cost: ~200–400ms per query on CPU inference with MiniLM. For an internal assistant where users expect 3–4s response times, this is acceptable. For customer-facing chatbots with sub-1s SLAs, benchmark first and consider skipping re-ranking on queries classified as simple (single-document, low complexity).

Change 4: Source Hierarchy + Metadata Tagging

Impact: 🟡 Medium

When Confluence has a "remote work policy" document updated in March, and SharePoint has a different version updated in January, and both are in your knowledge base — which one does the retriever trust?

Without source hierarchy metadata, the LLM gets both chunks, infers authority from the text, and guesses. It guesses wrong often enough to matter.

We added three metadata fields to every document on ingestion:

from langchain.schema import Document
from datetime import datetime

def create_document_with_metadata(
    text: str,
    source: str,
    source_authority: int,  # 1=primary, 2=secondary, 3=supplementary
    last_updated: datetime,
    domain: str
) -> Document:
    """
    Wrap chunk text with authority metadata for conflict resolution.

    source_authority levels:
        1 = Primary (HR portal, official policy docs, product spec sheets)
        2 = Secondary (team wikis, internal guides, how-to docs)
        3 = Supplementary (Slack archives, meeting notes, draft docs)
    """
    return Document(
        page_content=text,
        metadata={
            "source": source,
            "source_authority": source_authority,
            "last_updated": last_updated.isoformat(),
            "domain": domain
        }
    )

# Retrieval with authority filter — prefer primary sources
vector_retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 20,
        "filter": {"source_authority": {"$lte": 2}}  # primary + secondary only
    }
)
Enter fullscreen mode Exit fullscreen mode

The heuristic is simple: when two sources conflict on the same topic, source_authority=1 wins. This eliminates the most embarrassing failure class — confident wrong answers sourced from an outdated secondary document when the primary existed and said something different.

This change required no model updates and no prompt changes. It's a data pipeline discipline issue, not an AI problem.

Change 5: Structured Evaluation Suite (The One Most Teams Skip)

Impact: 🔴 Foundational

This is the process change, not the architecture change. And it's the one that made every other change above verifiable rather than speculative.

What we built:

  • 150 representative queries sampled from production logs, not cherry-picked happy-path examples, not queries we designed. Stratified across query types: single-document, multi-document, exact-match, comparative, conditional.
  • Reference answers written by domain experts for each query, the ground truth the system should produce.
  • Automated RAGAS grading across faithfulness, answer relevance, context precision, and context recall.
  • Manual spot-check on ~15% of borderline-scored answers.
  • CI gate: eval suite runs on every deployment. A drop of >3% on any category blocks the deployment.
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

def run_eval(questions, answers, contexts, ground_truths):
    """
    Run RAGAS evaluation against reference answers.

    Returns a dict with per-metric scores and overall accuracy.
    """
    data = {
        "question": questions,
        "answer": answers,
        "contexts": contexts,       # list of lists — retrieved chunks per question
        "ground_truth": ground_truths
    }

    dataset = Dataset.from_dict(data)

    result = evaluate(
        dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall
        ]
    )

    return result
Enter fullscreen mode Exit fullscreen mode

Without this eval pipeline, you're flying blind. Developers eyeball outputs and call it "seems good." Users report the failures that are loud enough to surface, a heavily biased sample of what's actually failing. The 38% failing quietly never gets fixed.

Final Results

All five changes applied, same LLM throughout:

Metric Baseline After Delta
Overall accuracy 62% 94% +32pp
Multi-document reasoning 41% 87% +46pp
Exact-match retrieval 58% 96% +38pp
False confidence rate 68% 12% -56pp

The false confidence number is the one I find most significant. 68% of wrong answers originally had zero hedging language — the model sounded equally certain whether it was right or wrong. After fixing the retrieval layer (without touching the model or prompt), that rate dropped to 12%.

Overconfidence wasn't a model behavior problem. It was retrieval quality expressing itself as false certainty. Fix what goes into the context window, and the model's calibration improves without any prompt engineering.

What I'd Do Differently

Start with the eval suite. Not after you get complaints, before you go live. The 4–6 hours of building a 150-query eval set with reference answers will save you weeks of debugging production failures blind.

Build for the full query distribution, not the happy path. Your test suite during development is systematically biased toward queries you designed. Real users will find the edges. Measure in production with real queries before you tune anything.

The model is rarely the bottleneck. If your RAG system is hallucinating, look at retrieval before you look at the LLM. Changing the model is expensive, slow, and usually doesn't fix a retrieval quality problem. Changing the retrieval architecture is faster and the impact is larger.

Still On the List

Things we haven't tested yet on this system:

  • HyDE (Hypothetical Document Embeddings) for query expansion — benchmarks look promising, mixed reports on real-world gains for knowledge-domain workloads
  • Query decomposition for explicitly multi-hop questions — breaking compound queries into sub-queries before retrieval
  • Dynamic eval set refresh — our static 150-query set is starting to feel unrepresentative as the knowledge base and user base evolve

Happy to go deeper on any of these in the comments, or on RAGAS configuration and the source hierarchy schema if there's interest.

Sunil Kumar is CEO of Ailoitte, an AI-native product engineering company that builds production-grade AI systems for startups, enterprises, and regulated industries. Ailoitte's AI Velocity Pods deliver fixed-price, outcome-based software development — including production RAG infrastructure with evaluation pipelines built in from day one.

Hitting a RAG quality wall? Let's talk.

Top comments (0)