The RAG System That Found Contradicting Answers (And Confidently Picked The Wrong One)

#contradictions #rag #ai #discuss

I built a RAG system for a fintech company's policy knowledge base. Customer asks about refund policy, system retrieves documentation, generates answer. Retrieval found five relevant chunks. Four said refunds within thirty days. One said refunds within fourteen days.
The AI confidently told customers they had thirty days to request refunds. The actual policy was fourteen days. Two hundred seventeen customers were given wrong information before anyone noticed.

The Setup
Fintech platform with comprehensive documentation. Policies, procedures, terms of service, FAQs, internal guidelines. Everything indexed in one knowledge base for the support AI to reference.
Standard RAG flow. Customer question triggers semantic search across documentation. Top five most relevant chunks retrieved. LLM generates answer using those chunks as context.
I tested with forty questions. Answers matched documentation. Looked solid. Deployed.

The Conflicting Sources
Three weeks in, compliance team flagged an issue. Customers were claiming the AI told them refunds were available within thirty days. The current policy was fourteen days and had been for six months.
I pulled the retrieval logs for refund policy questions. Every query returned five chunks. Four chunks said thirty days. One chunk said fourteen days.
The four thirty-day chunks were old. From documentation written before the policy changed. They had not been deleted or archived. They still existed in the knowledge base, still getting retrieved, still being fed to the LLM.
The LLM saw four sources saying thirty days and one source saying fourteen days. It chose the majority. Four votes for thirty days, one vote for fourteen days. Thirty days wins.
Confidently wrong.

Why This Happened
The knowledge base contained documentation from multiple time periods. When policies changed, new documentation was added but old documentation was not removed. Historical context remained searchable.
The vector search ranked chunks by semantic similarity to the query, not by recency or accuracy. Old chunks about refund policy were just as semantically relevant as new chunks. Sometimes more relevant because older documentation was more detailed.
The retrieval system had no concept of document freshness, version history, or authoritative sources. Every chunk was treated equally. A paragraph from two years ago had the same weight as a paragraph from last month.
When contradictions appeared in retrieved chunks, the LLM had no guidance on how to handle them. It defaulted to majority voting or picked whichever chunk appeared first in the context window.

The Scope of The Problem
I audited the knowledge base. Out of eight hundred documents, one hundred thirty four had been updated in the past year with policy or procedure changes. Only nineteen of those updates included explicit deprecation of the old versions.
That meant one hundred fifteen outdated documents were still live in the vector database, still being retrieved, still generating wrong answers. Refund policy. Pricing tiers. Feature availability. Support hours. Interest rates. All potentially contradicted by newer documentation.
Twenty two percent of the total documentation was outdated but still active.

The Failed Fix
I tried adding timestamps to chunks and telling the LLM to prefer recent information. The prompt said: "If chunks contain conflicting information, prioritize the most recent."
That helped slightly but not enough. The LLM did not consistently identify conflicts. Sometimes it blended old and new information into hybrid answers that were partially correct and partially outdated.
Also, recency is not always the right filter. Sometimes old documentation contains important historical context or grandfathered policies that still apply to certain customers.

The Real Solution Was Source Authority
The fix required three changes to how the system handled retrieved chunks.
First, document versioning. Every document now has a version number and status flag. Active, deprecated, or archived. Deprecated and archived documents are excluded from retrieval by default unless the query specifically asks for historical information.
Second, authority ranking. Documents are tagged by source authority. Official policy documents have the highest authority. Internal guidelines have medium authority. Draft documents or old FAQs have low authority. When conflicts appear, higher authority sources win regardless of semantic similarity score or timestamp.
Third, conflict detection in the generation prompt. The LLM is explicitly instructed: "Check if retrieved chunks contradict each other. If they do, identify which source has the most recent timestamp and highest authority. Use only that source for your answer. If you cannot resolve the conflict, state that policies may have changed and escalate to human support."

What Changed
Question: "What is your refund policy?"
Old behavior:
Retrieved five chunks, four outdated, one current
Generated answer based on majority: thirty days
Wrong information given to customer
New behavior:

Retrieved five chunks
System filtered: only active documents included
If old chunks still appeared due to search algorithm, conflict detection triggered
LLM identified contradiction, checked authority and timestamp
Used only the current active policy document
Correct answer: fourteen days

The Results
Before the fix, twenty two percent of documentation was outdated and active. Conflicting information appeared in thirty seven percent of complex queries. Wrong answers were given to customers two hundred seventeen times before detection.
After the fix, outdated documents excluded from search. Conflicting information rate dropped to four percent, mostly edge cases with legitimately different policies for different customer tiers. Wrong answers dropped to near zero.
The business impact was significant. Compliance risk eliminated. Customer disputes over policy misinformation stopped. Support team regained trust in the AI. Legal department approved continued use of the system.

What I Learned
Knowledge bases are not static. Documentation accumulates over time. Old information does not automatically disappear when new information is added. Retrieval systems without versioning or deprecation treat all content equally regardless of accuracy.
Semantic similarity is not the same as correctness. An outdated document can be highly relevant to a query while being factually wrong. Retrieval must filter by authority and currency, not just relevance.
LLMs will not automatically detect or resolve contradictions in retrieved context. They will synthesize, blend, or pick based on unclear heuristics unless explicitly prompted to identify conflicts and apply resolution rules.

The Bottom Line
A RAG system retrieving from unversioned documentation generated wrong answers by confidently choosing outdated information when retrieved chunks conflicted. The fix was document versioning, authority ranking, and explicit conflict detection in the generation prompt.

Written by Farhan Habib Faraz
Senior Prompt Engineer building conversational AI and voice agents

Tags: rag, documentversioning, contradictions, knowledgebase, retrieval, policyaccuracy