João Paulo Traguetta Rufino

Posted on May 30 • Edited on Jun 4

5 Failure Modes I Found in My Financial RAG (And the One That Actually Mattered)

#ai #programming #python #rag

My RAG system for financial document Q&A was stuck at 53% accuracy. I spent two weeks implementing hybrid retrieval, metadata filtering, and query routing. Accuracy went to 58%.

Then I ran a corpus audit and found that 5 documents were never ingested and 2 were corrupted. Fixing that alone pushed recall from 83% to 94%.

The most impactful improvement in the entire project took 30 minutes and zero lines of new code.

The setup

Quick context: I'm building a RAG system evaluated against FinanceBench (Patronus AI), a benchmark with 150 expert-annotated Q&A pairs about SEC filings. The pipeline is GPT-4o-mini for generation, text-embedding-3-small for embeddings, and Qdrant as the vector store. Full eval infrastructure with LLM-as-judge calibrated against human labels ([Post 1 covers the eval setup]https://dev.to/joaopaulotr/building-an-evaluation-harness-for-financial-rag-what-i-learned-about-llm-as-judge-calibration-5030).

After Phase 2, I had a baseline: Recall@6 of 0.83, and about 47 out of 100 queries answered correctly (verified by human labels).

The 5 failure modes

I categorized every error in my 100-query eval set. Here's what I found:

1. Missing documents (the big one)

I was debugging why Johnson & Johnson queries always failed. 9 out of 17 retrieval misses were J&J documents. I assumed it was a semantic similarity problem since all SEC filings use nearly identical language.

It wasn't. The documents were never downloaded.

An audit revealed that 5 out of 84 documents in the dataset were missing from my vector store, and 2 more were corrupted during PDF extraction (AMD and KraftHeinz had 5 and 0 chunks respectively instead of 150+). After fixing this, retrieval misses dropped from 16 to 6.

Metric	Before corpus fix	After corpus fix	Delta
Recall@6	0.840	0.940	+0.100
Retrieval misses	16	6	-10

This is the lesson I keep coming back to: I spent days implementing algorithmic improvements when the root cause was incomplete data. In production, this happens constantly. Pipelines that fail silently, documents that get skipped, files that corrupt during processing. No amount of reranking or query rewriting fixes missing data.

2. Cross-document confusion within the same company

After ingesting the missing J&J documents, a new problem appeared. The retriever sometimes pulled chunks from J&J's 2022 10-K when the query was about J&J's 2023 8-K. Same company, wrong document.

My metadata filter extracted company name from the query and filtered Qdrant before retrieval. But filtering by "Johnson & Johnson" doesn't help when both the right and wrong documents are from Johnson & Johnson.

Fix: extended the filter to extract company + year + document type. This helped but didn't fully resolve it since the language overlap between a company's own filings across years is even higher than between different companies.

3. Numerical extraction errors

The model retrieves the right document, finds the right section, but produces the wrong number. Quick ratio of 1.76 when the correct answer is 1.57. Dividend payout ratio of 83.56% when it should be 80%.

These errors are invisible to a standard LLM judge because the answer is well-structured, uses correct methodology, and sounds right. More on this below.

4. Hybrid retrieval noise

I added BM25 alongside dense retrieval, fused via Reciprocal Rank Fusion. The theory: keyword matching catches exact terms that semantic search misses.

The result: Precision dropped from 0.422 to 0.405 and MRR dropped from 0.646 to 0.594. BM25 was pulling in keyword-matching chunks that weren't semantically relevant, pushing the correct chunks lower in the ranking. Not every theoretical improvement works in practice.

5. Judge inflation

This one is subtle and easy to miss.

My LLM judge (A|GT, with ground truth) said 63 out of 100 answers were correct. When I checked against 30 human labels, the real number was about 47. The judge was inflating scores by 34%.

Why? The judge evaluated reasoning quality and methodology, not numerical accuracy. An answer that says "$1,608M" when the correct answer is "$2,018M" got a 5/5 because the explanation was well-structured.

I built a stricter judge (v2) with explicit numerical comparison rules. It brought the estimate down to 51/100, much closer to reality. But it overcorrected: TPR dropped from 0.93 to 0.71, meaning it now rejects 29% of correct answers.

Judge	Reported correct	TPR	TNR	Inflation vs human
v1 (lenient)	63/100	0.93	0.75	+16
v2 (strict)	51/100	0.71	0.94	+4
Human	~47/100	—	—	—

There is no perfect judge. You choose your bias: false positives (approve wrong answers) or false negatives (reject correct answers). The only ground truth is human labels.

What actually moved the needle

Here's the full progression:

Phase	Recall@6	Accuracy (human)	What changed
Phase 2 (baseline)	0.830	~47/100	Nothing, first measurement
Phase 3 (algorithmic fixes)	0.840	~47/100	Hybrid retrieval + metadata filtering
Phase 3b (corpus fix)	0.940	~47/100	Ingested 5 missing + 2 corrupted docs

The uncomfortable finding: retrieval improved significantly (83% to 94%) but real accuracy stayed at ~47%. The bottleneck shifted from retrieval to generation. The model now gets the right document but still produces wrong numbers.

What I didn't implement (and why)

Approach	Why I skipped it
Semantic chunking	Requires full re-ingestion, uncertain impact vs current chunking
Reranker (Cohere/cross-encoder)	Adds latency and cost, lower priority than data quality fixes
Query rewriting / HyDE	Adds LLM call per query, cross-company noise likely persists
Contextual retrieval	High potential but requires full re-ingestion pipeline

These aren't bad ideas. They're deferred decisions. The point is knowing when to stop iterating on retrieval and start looking at generation quality, which is where the bottleneck is now.

What I learned

Data quality beats algorithms. My most sophisticated fix (hybrid retrieval with RRF) had negative impact on two metrics. My simplest fix (downloading missing files) had the largest positive impact of the entire project.

Your eval is only as good as your judge. Three different evaluation approaches gave me three different accuracy numbers: 63%, 51%, and 47%. Without human calibration, I would have reported 63% and believed the pipeline was improving when it wasn't.

Know when the bottleneck shifts. I could keep optimizing retrieval, but with Recall at 94%, the remaining errors are in generation. The next improvements need to target how the model extracts and reasons about numbers, not how it finds documents.

Repo: financebench-rag-eval

References

FinanceBench — Patronus AI
6 RAG Evals — Jason Liu
LLM Evals FAQ — Hamel Husain
Arize Phoenix
AI Builder's Handbook — LevelUp Labs

Top comments (8)

Tae Kim • Jun 5

For #3 the thing that actually worked for me was forcing the generator to copy verbatim numeric spans from the retrieved chunk — if the exact figure isn't a literal substring of the source, the claim gets dropped. Killed the digit-transposition and year-mix class cleanly. Trade-off: it refuses when the source says "$1.6 billion" and the question wants "$1,608M", so I normalized numeric tokens on the source side (commas, scale words, currency symbols) before the substring check.

João Paulo Traguetta Rufino • Jun 5

That's a smart approach. I didn't try forcing verbatim extraction but it makes sense for this domain, most of the numerical errors I found were exactly that kind of digit transposition or scale confusion ($1.6B vs $1,608M). The normalization step on the source side is the key insight, otherwise you'd reject too many valid matches. I might experiment with this as a post-generation validation step in the CRAG pipeline, extract the number from the answer, normalize it, and check against the source chunk before returning. Thanks for sharing what worked.

Tae Kim • Jun 5

The gotcha for me was bidirectional normalization. Model writes "$1.6B", source has "1,608 million", and a string check fails even though both are correct. I ended up parsing both sides to canonical floats with a small tolerance band, scale-unit mismatches were where most of my false negatives lived.

arun rajkumar • Jun 5

The judge-inflation section is the part more fintech teams need to sit with. In payments the cost of errors is brutally asymmetric — a confidently wrong number that sounds right is far more dangerous than an honest "I can't answer that," because someone downstream acts on it without a second look. So your v2 judge "overcorrecting" to reject 29% of correct answers isn't really a bug; for this domain that's arguably the correct bias — false negatives are cheap (a human double-checks) and false positives move money. And the corpus-audit lesson is an observability gap dressed up as a RAG problem: we treat "did every document land with the expected chunk count" as a build-failing assertion, same as any ETL. If it can fail silently, it will, and the model takes the blame. Have you looked at surfacing per-field confidence — isolating the extracted number from the prose so a human only has to eyeball the figure, not re-read the whole answer?

João Paulo Traguetta Rufino • Jun 5

Great point on the asymmetry. I hadn't framed it that way but you're right — in financial domain, a judge that rejects correct answers is safer than one that approves wrong ones. The v2 'overcorrection' is actually the correct trade-off when someone downstream might act on the number. On per-field confidence: I haven't implemented it yet but it's a natural next step. Right now the answer is a full prose paragraph and the human has to re-read everything to verify. Extracting the numeric claim separately (e.g. 'revenue: $1,577M [confidence: high, exact match in source]') would make human review much faster. Added to future work. And yes, the corpus audit should absolutely be a build-failing assertion — silent ingestion failures are the most dangerous kind because the system keeps working, just worse.

arun rajkumar • Jun 6

The per-field confidence idea is the one I'd chase first — "revenue: $1,577M [confidence: high, exact match in source]" turns a prose blob a human has to re-read into something you can actually gate on. That's the whole game in regulated finance: make the model's uncertainty machine-readable so review scales instead of becoming the bottleneck. And yes — build-failing on silent ingestion gaps. The failures that keep the system running while quietly wrong are the expensive ones, because nobody's looking. Great series; subscribed for v2.

Harjot Singh • May 31

This is the most under-told lesson in all of RAG: you spent two weeks on the sophisticated levers (hybrid retrieval, metadata filtering, query routing) for +5%, and 30 minutes auditing the corpus for +11%. The clever stuff optimizes how well you retrieve from what's there; it can't retrieve a document that was never ingested or is silently corrupted. Garbage-or-missing-in, confidently-wrong-out, and no amount of reranking fixes a hole in the index. The reason this gets skipped is that data quality is unglamorous and invisible, nobody demos a corpus audit, and the pipeline gives you no error when 5 docs quietly failed to ingest, it just answers worse and you blame the model. The discipline I'd take from this: before tuning retrieval, verify the ground truth is actually all there and intact, ingestion needs its own check (did every doc land, is it readable) the same way you'd validate any ETL. Fix the data before you optimize the retrieval. That verify-the-inputs-first instinct is core to how I build with RAG in Moonshift. Did you add an ingestion-completeness check after this, or is the corpus audit still a manual periodic thing?

João Paulo Traguetta Rufino • Jun 3

Thanks, that's exactly the lesson. After finding the gap I added a post-ingestion validation script that cross-references the dataset's doc_name list against what's actually in Qdrant. That's how I caught the 5 missing docs. The corrupted ones (AMD and KraftHeinz) I found manually when investigating why certain queries kept failing despite the documents supposedly being ingested. It's not a fully automated pipeline check yet, it runs as a separate audit, but it catches silent failures before they pollute the eval. Next step would be integrating it into the ingestion pipeline with a chunk count threshold per document so corrupted ingestions get flagged automatically.