# Enterprise RAG’s Biggest Risk: Answers That Look Correct but Aren’t

Anthony Jiang — Wed, 03 Jun 2026 15:00:54 +0000

Most RAG demos feel impressive at first.

You upload documents, ask a question, and the system returns a fluent answer with citations. For example:

What was Tesla’s automotive revenue in 2023?

The system retrieves a passage from the annual report, generates an answer, and attaches a source.

At that point, it is tempting to think the system is close to usable.

But after building my own renewable energy industry RAG agent, I realized that answering is only the first layer. The harder problem is proving that the answer is actually reliable.

In enterprise documents, many failures are not obvious hallucinations. They are “almost correct” answers:

the number exists, but comes from the wrong financial scope
the answer is correct, but the citation does not really support it
the PDF table is parsed, but the metric and value are misaligned
the right evidence is retrieved, but ranked too low
fixing one bad case causes other cases to regress

That is why I added an Experience + Repair Pipeline on top of my original RAG system.

The goal was not to make the model sound better. The goal was to make failures traceable, repairs testable, and quality improvement repeatable.

The Original Project

The first version of my project was a renewable energy industry research agent.

It retrieved evidence from annual reports, investor materials, and public documents, then answered research questions about electric vehicles, batteries, energy storage, and solar companies.

Example questions included:

What was Tesla’s automotive revenue in 2023?
What financial performance did First Solar disclose in its annual report?
What capacity expansion plans did LONGi Green Energy mention?
What was CATL’s global energy storage battery market share?

The system already supported document ingestion, chunking, embeddings, hybrid retrieval, reranking, evidence citation, structured financial metric extraction, frontend workflows, and Docker deployment.

But once the first version was working, a different kind of problem started to appear.

A RAG system can answer a question and still be wrong.

A Real Failure: The Number Was Real, but the Scope Was Wrong

One bad case looked like this:

What was CATL’s net cash flow from operating activities in 2023?

The system answered:

9,282,612.44 ten thousand RMB

This was not a hallucinated number. It really appeared in CATL’s 2023 annual report, and it was related to net cash flow from operating activities.

At first glance, it looked correct.

But the gold evidence in my evaluation set pointed to a different table: the parent company cash flow statement.

The correct answer for that evidence was:

5,647,457.04 ten thousand RMB

So the system did not invent a number. It found a real number, but from the wrong financial scope.

That kind of failure is dangerous because it looks professional. The answer is not obviously wrong. You only notice the issue when you check the exact table, scope, row, and citation.

Another Failure: The Answer Was Correct, but the Citation Was Wrong

Another case was:

How many PhD and master’s degree researchers did CATL have?

The answer was correct:

361 PhD researchers and 3,913 master’s degree researchers.

But the citation pointed to similar chunks instead of the actual gold evidence.

Those chunks also contained words like “PhD”, “master’s”, and “R&D staff”. Some of them even had related numbers. But they were not the most direct supporting evidence for this answer.

In enterprise RAG, citation quality matters.

Users do not only need an answer. They need to know where the answer came from.

That is why I started treating citation accuracy as a separate metric, not just a nice-to-have.

Making Failed Cases Explain Themselves

When a RAG system fails, the usual workflow is very manual.

You open the bad case, inspect the retrieved chunks, guess what went wrong, change something, and run a few questions again.

That can work for a few examples. It does not scale.

So I added an Experience layer.

In this project, an experience is not just a log.

A log tells me:

What was the input?
What was the output?
Did anything crash?

An experience should tell me:

Why did this case fail?
Did the failure happen in retrieval, reranking, answer extraction, citation, or judging?
Was the problem caused by PDF table parsing?
Was it caused by financial statement scope?
Was the answer correct but the citation wrong?
Can this be repaired automatically?
Which cases should be rerun to verify the repair?

Each failed case is converted into structured information:

root_cause
root_cause_detail
diagnostics
trace
repair_iteration_hint

Some failure types I saw repeatedly:

table_line_break
duplicate_equivalent_chunk
market_share
gold_rank_4_or_5
scope_conflict
neighbor_value_selected

The diagnostics also store details such as:

gold_rank
answer_numbers
question_terms
evidence_candidates
citation_repair
structured_fact
missing_schema_fields

This turns a failed case from an isolated bug into something the system can reason about.

Repair Pipeline: Stop Fixing RAG by Guesswork

Once failed cases had structured explanations, I built the next step: repair candidates.

A repair candidate is not an immediate code change. It is a proposed fix that must be tested.

The loop looks like this:

eval set
→ failed cases
→ experience
→ root cause and diagnostics
→ repair plan
→ repair candidates
→ targeted regression
→ acceptance gate

The repair candidates I used included:

market_share_binding
table_row_stitch
equivalent_citation_group
rerank_exact_phrase_hint

Market Share Binding

Market share questions often contain many percentages in one paragraph.

A report may mention:

EV battery market share
energy storage battery market share
global market share
domestic market share
year-over-year growth

The system cannot simply pick a nearby percentage.

For this case, the repair candidate binds:

subject + market + metric + value + citation

So the value must match the requested subject and market context, not just any percentage in the same chunk.

Table Row Stitching

PDF tables often break rows apart.

A human can understand this:

Net cash flow from operating activities
5,647,457.04

But a parser may treat the metric and the value as separate lines.

The table_row_stitch candidate tries to rebuild the row by binding:

table title
metric line
value line
period
unit
citation
statement scope

This helped fix the CATL operating cash flow case.

Equivalent Citation Groups

Sometimes the answer is correct, but the cited chunk is not the exact gold chunk.

This can happen because of chunk overlap, duplicated tables, or repeated evidence across nearby pages.

So I added equivalent_citation_group.

If a cited chunk and the gold chunk share the key question terms and the key answer values, they can be treated as equivalent supporting evidence.

For the CATL R&D staff case, chunks containing both:

PhD: 361
Master’s: 3,913

could be grouped as equivalent evidence for the same answer.

Rerank Phrase Hints

Sometimes the right evidence is retrieved, but ranked fourth or fifth.

In that case, the model may cite a similar chunk ranked above it.

rerank_exact_phrase_hint uses question terms, answer values, and gold-like phrases to help promote the more relevant evidence.

The Acceptance Gate

Repair candidates are not accepted just because they fix one example.

They must pass targeted regression.

The acceptance gate checks:

Did the number of failed cases decrease?
Did answer accuracy stay the same or improve?
Did citation accuracy stay the same or improve?
Did hallucination rate stay the same or decrease?

This caught a real regression.

At one point, a new set of repair candidates fixed the final two bad cases, but caused three previously fixed cases to fail again.

If I had only looked at the last two cases, I would have thought the repair worked.

The acceptance gate rejected it.

After that, I added candidate merging. Historical accepted candidates are merged with new candidates before running regression again.

With candidate merging, all five bad cases passed.

Results from One Iteration

In this round, I evaluated 57 QA cases.

After several iterations, 5 representative bad cases remained. They covered market share extraction, PDF table line breaks, financial statement scope, citation binding, and reranking.

Before repair:

5 targeted bad cases
5 failed
answer accuracy: 60%
citation accuracy: 0%

After the first repair candidate application:

5 targeted bad cases
2 failed
answer accuracy: 80%
citation accuracy: 60%
hallucination rate: 0%

After fixing the remaining two and merging historical candidates:

5 targeted bad cases
0 failed
answer accuracy: 100%
citation accuracy: 100%
hallucination rate: 0%

This does not mean the entire RAG system is now 100% correct.

It means that, for this set of real bad cases, the pipeline was able to:

detect failures
explain failures
generate repair candidates
run targeted regression
reject regressive repairs
merge accepted candidates
continue improving

That is much more reliable than asking a few questions manually and deciding that the system “feels better”.

Why Scheduled Jobs and CI/CD Make This More Useful

Right now, this pipeline can be run manually.

The more useful direction is to connect it with scheduled jobs and CI/CD.

Enterprise RAG systems are not static. Quality can change whenever:

new documents are added
chunks are rebuilt
embedding models change
reranking strategies change
prompts are updated
PDF parsers are changed
DataJuicer cleaning rules change

Each change may silently break something.

A new chunking strategy may improve recall but hurt citation accuracy.

A prompt update may make answers more fluent but less grounded.

A PDF parser fix may solve one table but misalign another.

A reranker change may promote the right evidence for one query but demote it for another.

If the Experience + Repair Pipeline is wired into scheduled jobs or CI/CD, it can automatically run:

evaluation
failure experience generation
repair candidate generation
targeted regression
quality gate

For a RAG engineer, this means less repetitive spot checking.

Instead of repeatedly asking a few questions after every change, the system can report:

which cases failed
why they failed
what repair candidates were generated
whether the repair improved quality
whether it caused regressions

The pipeline is not meant to replace engineers. It is meant to reduce repetitive debugging and make quality checks repeatable.

Engineers can then focus on higher-level decisions:

Should this repair candidate be accepted?
Is this failure caused by data parsing, retrieval, answer extraction, or judging?
Is the quality gate strong enough for production?
Which type of failure is becoming frequent?

This is closer to RAG QAOps than traditional prompt tuning.

My Takeaway

I used to think of RAG as:

retrieval + generation

Now I think enterprise RAG needs to be:

retrieval
+ generation
+ evaluation
+ experience
+ repair
+ regression

The hard part is not making the model answer.

The hard part is making every answer accountable.

In enterprise document scenarios, many failures are not obvious hallucinations. They are subtle “almost correct” answers:

The number exists, but the financial scope is wrong.
The answer is correct, but the citation is wrong.
The evidence was retrieved, but ranked too low.
The table was parsed, but the metric and value were misaligned.

These problems are hard to manage with manual spot checks alone.

That is why I believe enterprise RAG needs an Experience + Repair Pipeline.

If the first stage of RAG is “can answer”, and the second stage is “can be evaluated”, then the third stage should be:

can continuously repair itself, and know when not to auto-repair.

DEV Community: Anthony Jiang