<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anthony Jiang</title>
    <description>The latest articles on DEV Community by Anthony Jiang (@anthonyincanada).</description>
    <link>https://dev.to/anthonyincanada</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3966773%2Ffe6679a8-e248-42ea-b76e-0f1b0c30ad92.png</url>
      <title>DEV Community: Anthony Jiang</title>
      <link>https://dev.to/anthonyincanada</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anthonyincanada"/>
    <language>en</language>
    <item>
      <title># Enterprise RAG’s Biggest Risk: Answers That Look Correct but Aren’t</title>
      <dc:creator>Anthony Jiang</dc:creator>
      <pubDate>Wed, 03 Jun 2026 15:00:54 +0000</pubDate>
      <link>https://dev.to/anthonyincanada/-enterprise-rags-biggest-risk-answers-that-look-correct-but-arent-n84</link>
      <guid>https://dev.to/anthonyincanada/-enterprise-rags-biggest-risk-answers-that-look-correct-but-arent-n84</guid>
      <description>&lt;p&gt;Most RAG demos feel impressive at first.&lt;/p&gt;

&lt;p&gt;You upload documents, ask a question, and the system returns a fluent answer with citations. For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What was Tesla’s automotive revenue in 2023?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system retrieves a passage from the annual report, generates an answer, and attaches a source.&lt;/p&gt;

&lt;p&gt;At that point, it is tempting to think the system is close to usable.&lt;/p&gt;

&lt;p&gt;But after building my own renewable energy industry RAG agent, I realized that answering is only the first layer. The harder problem is proving that the answer is actually reliable.&lt;/p&gt;

&lt;p&gt;In enterprise documents, many failures are not obvious hallucinations. They are “almost correct” answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the number exists, but comes from the wrong financial scope&lt;/li&gt;
&lt;li&gt;the answer is correct, but the citation does not really support it&lt;/li&gt;
&lt;li&gt;the PDF table is parsed, but the metric and value are misaligned&lt;/li&gt;
&lt;li&gt;the right evidence is retrieved, but ranked too low&lt;/li&gt;
&lt;li&gt;fixing one bad case causes other cases to regress&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why I added an Experience + Repair Pipeline on top of my original RAG system.&lt;/p&gt;

&lt;p&gt;The goal was not to make the model sound better. The goal was to make failures traceable, repairs testable, and quality improvement repeatable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Original Project
&lt;/h2&gt;

&lt;p&gt;The first version of my project was a renewable energy industry research agent.&lt;/p&gt;

&lt;p&gt;It retrieved evidence from annual reports, investor materials, and public documents, then answered research questions about electric vehicles, batteries, energy storage, and solar companies.&lt;/p&gt;

&lt;p&gt;Example questions included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What was Tesla’s automotive revenue in 2023?&lt;/li&gt;
&lt;li&gt;What financial performance did First Solar disclose in its annual report?&lt;/li&gt;
&lt;li&gt;What capacity expansion plans did LONGi Green Energy mention?&lt;/li&gt;
&lt;li&gt;What was CATL’s global energy storage battery market share?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system already supported document ingestion, chunking, embeddings, hybrid retrieval, reranking, evidence citation, structured financial metric extraction, frontend workflows, and Docker deployment.&lt;/p&gt;

&lt;p&gt;But once the first version was working, a different kind of problem started to appear.&lt;/p&gt;

&lt;p&gt;A RAG system can answer a question and still be wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real Failure: The Number Was Real, but the Scope Was Wrong
&lt;/h2&gt;

&lt;p&gt;One bad case looked like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What was CATL’s net cash flow from operating activities in 2023?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system answered:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;9,282,612.44 ten thousand RMB&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This was not a hallucinated number. It really appeared in CATL’s 2023 annual report, and it was related to net cash flow from operating activities.&lt;/p&gt;

&lt;p&gt;At first glance, it looked correct.&lt;/p&gt;

&lt;p&gt;But the gold evidence in my evaluation set pointed to a different table: the parent company cash flow statement.&lt;/p&gt;

&lt;p&gt;The correct answer for that evidence was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;5,647,457.04 ten thousand RMB&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So the system did not invent a number. It found a real number, but from the wrong financial scope.&lt;/p&gt;

&lt;p&gt;That kind of failure is dangerous because it looks professional. The answer is not obviously wrong. You only notice the issue when you check the exact table, scope, row, and citation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Another Failure: The Answer Was Correct, but the Citation Was Wrong
&lt;/h2&gt;

&lt;p&gt;Another case was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How many PhD and master’s degree researchers did CATL have?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer was correct:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;361 PhD researchers and 3,913 master’s degree researchers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But the citation pointed to similar chunks instead of the actual gold evidence.&lt;/p&gt;

&lt;p&gt;Those chunks also contained words like “PhD”, “master’s”, and “R&amp;amp;D staff”. Some of them even had related numbers. But they were not the most direct supporting evidence for this answer.&lt;/p&gt;

&lt;p&gt;In enterprise RAG, citation quality matters.&lt;/p&gt;

&lt;p&gt;Users do not only need an answer. They need to know where the answer came from.&lt;/p&gt;

&lt;p&gt;That is why I started treating citation accuracy as a separate metric, not just a nice-to-have.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making Failed Cases Explain Themselves
&lt;/h2&gt;

&lt;p&gt;When a RAG system fails, the usual workflow is very manual.&lt;/p&gt;

&lt;p&gt;You open the bad case, inspect the retrieved chunks, guess what went wrong, change something, and run a few questions again.&lt;/p&gt;

&lt;p&gt;That can work for a few examples. It does not scale.&lt;/p&gt;

&lt;p&gt;So I added an Experience layer.&lt;/p&gt;

&lt;p&gt;In this project, an experience is not just a log.&lt;/p&gt;

&lt;p&gt;A log tells me:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What was the input?
What was the output?
Did anything crash?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An experience should tell me:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Why did this case fail?
Did the failure happen in retrieval, reranking, answer extraction, citation, or judging?
Was the problem caused by PDF table parsing?
Was it caused by financial statement scope?
Was the answer correct but the citation wrong?
Can this be repaired automatically?
Which cases should be rerun to verify the repair?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each failed case is converted into structured information:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;root_cause
root_cause_detail
diagnostics
trace
repair_iteration_hint
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some failure types I saw repeatedly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;table_line_break
duplicate_equivalent_chunk
market_share
gold_rank_4_or_5
scope_conflict
neighbor_value_selected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The diagnostics also store details such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gold_rank
answer_numbers
question_terms
evidence_candidates
citation_repair
structured_fact
missing_schema_fields
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This turns a failed case from an isolated bug into something the system can reason about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repair Pipeline: Stop Fixing RAG by Guesswork
&lt;/h2&gt;

&lt;p&gt;Once failed cases had structured explanations, I built the next step: repair candidates.&lt;/p&gt;

&lt;p&gt;A repair candidate is not an immediate code change. It is a proposed fix that must be tested.&lt;/p&gt;

&lt;p&gt;The loop looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eval set
→ failed cases
→ experience
→ root cause and diagnostics
→ repair plan
→ repair candidates
→ targeted regression
→ acceptance gate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The repair candidates I used included:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;market_share_binding
table_row_stitch
equivalent_citation_group
rerank_exact_phrase_hint
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Market Share Binding
&lt;/h3&gt;

&lt;p&gt;Market share questions often contain many percentages in one paragraph.&lt;/p&gt;

&lt;p&gt;A report may mention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EV battery market share&lt;/li&gt;
&lt;li&gt;energy storage battery market share&lt;/li&gt;
&lt;li&gt;global market share&lt;/li&gt;
&lt;li&gt;domestic market share&lt;/li&gt;
&lt;li&gt;year-over-year growth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system cannot simply pick a nearby percentage.&lt;/p&gt;

&lt;p&gt;For this case, the repair candidate binds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;subject + market + metric + value + citation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the value must match the requested subject and market context, not just any percentage in the same chunk.&lt;/p&gt;

&lt;h3&gt;
  
  
  Table Row Stitching
&lt;/h3&gt;

&lt;p&gt;PDF tables often break rows apart.&lt;/p&gt;

&lt;p&gt;A human can understand this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Net cash flow from operating activities
5,647,457.04
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But a parser may treat the metric and the value as separate lines.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;table_row_stitch&lt;/code&gt; candidate tries to rebuild the row by binding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;table title
metric line
value line
period
unit
citation
statement scope
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This helped fix the CATL operating cash flow case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Equivalent Citation Groups
&lt;/h3&gt;

&lt;p&gt;Sometimes the answer is correct, but the cited chunk is not the exact gold chunk.&lt;/p&gt;

&lt;p&gt;This can happen because of chunk overlap, duplicated tables, or repeated evidence across nearby pages.&lt;/p&gt;

&lt;p&gt;So I added &lt;code&gt;equivalent_citation_group&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If a cited chunk and the gold chunk share the key question terms and the key answer values, they can be treated as equivalent supporting evidence.&lt;/p&gt;

&lt;p&gt;For the CATL R&amp;amp;D staff case, chunks containing both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PhD: 361
Master’s: 3,913
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;could be grouped as equivalent evidence for the same answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rerank Phrase Hints
&lt;/h3&gt;

&lt;p&gt;Sometimes the right evidence is retrieved, but ranked fourth or fifth.&lt;/p&gt;

&lt;p&gt;In that case, the model may cite a similar chunk ranked above it.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;rerank_exact_phrase_hint&lt;/code&gt; uses question terms, answer values, and gold-like phrases to help promote the more relevant evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Acceptance Gate
&lt;/h2&gt;

&lt;p&gt;Repair candidates are not accepted just because they fix one example.&lt;/p&gt;

&lt;p&gt;They must pass targeted regression.&lt;/p&gt;

&lt;p&gt;The acceptance gate checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Did the number of failed cases decrease?
Did answer accuracy stay the same or improve?
Did citation accuracy stay the same or improve?
Did hallucination rate stay the same or decrease?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This caught a real regression.&lt;/p&gt;

&lt;p&gt;At one point, a new set of repair candidates fixed the final two bad cases, but caused three previously fixed cases to fail again.&lt;/p&gt;

&lt;p&gt;If I had only looked at the last two cases, I would have thought the repair worked.&lt;/p&gt;

&lt;p&gt;The acceptance gate rejected it.&lt;/p&gt;

&lt;p&gt;After that, I added candidate merging. Historical accepted candidates are merged with new candidates before running regression again.&lt;/p&gt;

&lt;p&gt;With candidate merging, all five bad cases passed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results from One Iteration
&lt;/h2&gt;

&lt;p&gt;In this round, I evaluated 57 QA cases.&lt;/p&gt;

&lt;p&gt;After several iterations, 5 representative bad cases remained. They covered market share extraction, PDF table line breaks, financial statement scope, citation binding, and reranking.&lt;/p&gt;

&lt;p&gt;Before repair:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;5 targeted bad cases
5 failed
answer accuracy: 60%
citation accuracy: 0%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the first repair candidate application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;5 targeted bad cases
2 failed
answer accuracy: 80%
citation accuracy: 60%
hallucination rate: 0%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After fixing the remaining two and merging historical candidates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;5 targeted bad cases
0 failed
answer accuracy: 100%
citation accuracy: 100%
hallucination rate: 0%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This does not mean the entire RAG system is now 100% correct.&lt;/p&gt;

&lt;p&gt;It means that, for this set of real bad cases, the pipeline was able to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;detect failures
explain failures
generate repair candidates
run targeted regression
reject regressive repairs
merge accepted candidates
continue improving
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is much more reliable than asking a few questions manually and deciding that the system “feels better”.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Scheduled Jobs and CI/CD Make This More Useful
&lt;/h2&gt;

&lt;p&gt;Right now, this pipeline can be run manually.&lt;/p&gt;

&lt;p&gt;The more useful direction is to connect it with scheduled jobs and CI/CD.&lt;/p&gt;

&lt;p&gt;Enterprise RAG systems are not static. Quality can change whenever:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;new documents are added
chunks are rebuilt
embedding models change
reranking strategies change
prompts are updated
PDF parsers are changed
DataJuicer cleaning rules change
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each change may silently break something.&lt;/p&gt;

&lt;p&gt;A new chunking strategy may improve recall but hurt citation accuracy.&lt;br&gt;&lt;br&gt;
A prompt update may make answers more fluent but less grounded.&lt;br&gt;&lt;br&gt;
A PDF parser fix may solve one table but misalign another.&lt;br&gt;&lt;br&gt;
A reranker change may promote the right evidence for one query but demote it for another.&lt;/p&gt;

&lt;p&gt;If the Experience + Repair Pipeline is wired into scheduled jobs or CI/CD, it can automatically run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;evaluation
failure experience generation
repair candidate generation
targeted regression
quality gate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a RAG engineer, this means less repetitive spot checking.&lt;/p&gt;

&lt;p&gt;Instead of repeatedly asking a few questions after every change, the system can report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;which cases failed
why they failed
what repair candidates were generated
whether the repair improved quality
whether it caused regressions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline is not meant to replace engineers. It is meant to reduce repetitive debugging and make quality checks repeatable.&lt;/p&gt;

&lt;p&gt;Engineers can then focus on higher-level decisions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Should this repair candidate be accepted?
Is this failure caused by data parsing, retrieval, answer extraction, or judging?
Is the quality gate strong enough for production?
Which type of failure is becoming frequent?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is closer to RAG QAOps than traditional prompt tuning.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Takeaway
&lt;/h2&gt;

&lt;p&gt;I used to think of RAG as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;retrieval + generation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now I think enterprise RAG needs to be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;retrieval
+ generation
+ evaluation
+ experience
+ repair
+ regression
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hard part is not making the model answer.&lt;/p&gt;

&lt;p&gt;The hard part is making every answer accountable.&lt;/p&gt;

&lt;p&gt;In enterprise document scenarios, many failures are not obvious hallucinations. They are subtle “almost correct” answers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The number exists, but the financial scope is wrong.
The answer is correct, but the citation is wrong.
The evidence was retrieved, but ranked too low.
The table was parsed, but the metric and value were misaligned.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These problems are hard to manage with manual spot checks alone.&lt;/p&gt;

&lt;p&gt;That is why I believe enterprise RAG needs an Experience + Repair Pipeline.&lt;/p&gt;

&lt;p&gt;If the first stage of RAG is “can answer”, and the second stage is “can be evaluated”, then the third stage should be:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;can continuously repair itself, and know when not to auto-repair.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>devops</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
