DEV Community

João Paulo Traguetta Rufino
João Paulo Traguetta Rufino

Posted on

5 Failure Modes I Found in My Financial RAG (And the One That Actually Mattered)

My RAG system for financial document Q&A was stuck at 53% accuracy. I spent two weeks implementing hybrid retrieval, metadata filtering, and query routing. Accuracy went to 58%.

Then I ran a corpus audit and found that 5 documents were never ingested and 2 were corrupted. Fixing that alone pushed recall from 83% to 94%.

The most impactful improvement in the entire project took 30 minutes and zero lines of new code.

The setup

Quick context: I'm building a RAG system evaluated against FinanceBench (Patronus AI), a benchmark with 150 expert-annotated Q&A pairs about SEC filings. The pipeline is GPT-4o-mini for generation, text-embedding-3-small for embeddings, and Qdrant as the vector store. Full eval infrastructure with LLM-as-judge calibrated against human labels ([Post 1 covers the eval setup]https://dev.to/joaopaulotr/building-an-evaluation-harness-for-financial-rag-what-i-learned-about-llm-as-judge-calibration-5030).

After Phase 2, I had a baseline: Recall@6 of 0.83, and about 47 out of 100 queries answered correctly (verified by human labels).

The 5 failure modes

I categorized every error in my 100-query eval set. Here's what I found:

1. Missing documents (the big one)

I was debugging why Johnson & Johnson queries always failed. 9 out of 17 retrieval misses were J&J documents. I assumed it was a semantic similarity problem since all SEC filings use nearly identical language.

It wasn't. The documents were never downloaded.

An audit revealed that 5 out of 84 documents in the dataset were missing from my vector store, and 2 more were corrupted during PDF extraction (AMD and KraftHeinz had 5 and 0 chunks respectively instead of 150+). After fixing this, retrieval misses dropped from 16 to 6.

Metric Before corpus fix After corpus fix Delta
Recall@6 0.840 0.940 +0.100
Retrieval misses 16 6 -10

This is the lesson I keep coming back to: I spent days implementing algorithmic improvements when the root cause was incomplete data. In production, this happens constantly. Pipelines that fail silently, documents that get skipped, files that corrupt during processing. No amount of reranking or query rewriting fixes missing data.

2. Cross-document confusion within the same company

After ingesting the missing J&J documents, a new problem appeared. The retriever sometimes pulled chunks from J&J's 2022 10-K when the query was about J&J's 2023 8-K. Same company, wrong document.

My metadata filter extracted company name from the query and filtered Qdrant before retrieval. But filtering by "Johnson & Johnson" doesn't help when both the right and wrong documents are from Johnson & Johnson.

Fix: extended the filter to extract company + year + document type. This helped but didn't fully resolve it since the language overlap between a company's own filings across years is even higher than between different companies.

3. Numerical extraction errors

The model retrieves the right document, finds the right section, but produces the wrong number. Quick ratio of 1.76 when the correct answer is 1.57. Dividend payout ratio of 83.56% when it should be 80%.

These errors are invisible to a standard LLM judge because the answer is well-structured, uses correct methodology, and sounds right. More on this below.

4. Hybrid retrieval noise

I added BM25 alongside dense retrieval, fused via Reciprocal Rank Fusion. The theory: keyword matching catches exact terms that semantic search misses.

The result: Precision dropped from 0.422 to 0.405 and MRR dropped from 0.646 to 0.594. BM25 was pulling in keyword-matching chunks that weren't semantically relevant, pushing the correct chunks lower in the ranking. Not every theoretical improvement works in practice.

5. Judge inflation

This one is subtle and easy to miss.

My LLM judge (A|GT, with ground truth) said 63 out of 100 answers were correct. When I checked against 30 human labels, the real number was about 47. The judge was inflating scores by 34%.

Why? The judge evaluated reasoning quality and methodology, not numerical accuracy. An answer that says "$1,608M" when the correct answer is "$2,018M" got a 5/5 because the explanation was well-structured.

I built a stricter judge (v2) with explicit numerical comparison rules. It brought the estimate down to 51/100, much closer to reality. But it overcorrected: TPR dropped from 0.93 to 0.71, meaning it now rejects 29% of correct answers.

Judge Reported correct TPR TNR Inflation vs human
v1 (lenient) 63/100 0.93 0.75 +16
v2 (strict) 51/100 0.71 0.94 +4
Human ~47/100

There is no perfect judge. You choose your bias: false positives (approve wrong answers) or false negatives (reject correct answers). The only ground truth is human labels.

What actually moved the needle

Here's the full progression:

Phase Recall@6 Accuracy (human) What changed
Phase 2 (baseline) 0.830 ~47/100 Nothing, first measurement
Phase 3 (algorithmic fixes) 0.840 ~47/100 Hybrid retrieval + metadata filtering
Phase 3b (corpus fix) 0.940 ~47/100 Ingested 5 missing + 2 corrupted docs

The uncomfortable finding: retrieval improved significantly (83% to 94%) but real accuracy stayed at ~47%. The bottleneck shifted from retrieval to generation. The model now gets the right document but still produces wrong numbers.

What I didn't implement (and why)

Approach Why I skipped it
Semantic chunking Requires full re-ingestion, uncertain impact vs current chunking
Reranker (Cohere/cross-encoder) Adds latency and cost, lower priority than data quality fixes
Query rewriting / HyDE Adds LLM call per query, cross-company noise likely persists
Contextual retrieval High potential but requires full re-ingestion pipeline

These aren't bad ideas. They're deferred decisions. The point is knowing when to stop iterating on retrieval and start looking at generation quality, which is where the bottleneck is now.

What I learned

Data quality beats algorithms. My most sophisticated fix (hybrid retrieval with RRF) had negative impact on two metrics. My simplest fix (downloading missing files) had the largest positive impact of the entire project.

Your eval is only as good as your judge. Three different evaluation approaches gave me three different accuracy numbers: 63%, 51%, and 47%. Without human calibration, I would have reported 63% and believed the pipeline was improving when it wasn't.

Know when the bottleneck shifts. I could keep optimizing retrieval, but with Recall at 94%, the remaining errors are in generation. The next improvements need to target how the model extracts and reasons about numbers, not how it finds documents.

Repo: financebench-rag-eval

References

Top comments (0)