My RAG system for financial document Q&A was stuck at 53% accuracy. I spent two weeks implementing hybrid retrieval, metadata filtering, and query routing. Accuracy went to 58%.
Then I ran a corpus audit and found that 5 documents were never ingested and 2 were corrupted. Fixing that alone pushed recall from 83% to 94%.
The most impactful improvement in the entire project took 30 minutes and zero lines of new code.
The setup
Quick context: I'm building a RAG system evaluated against FinanceBench (Patronus AI), a benchmark with 150 expert-annotated Q&A pairs about SEC filings. The pipeline is GPT-4o-mini for generation, text-embedding-3-small for embeddings, and Qdrant as the vector store. Full eval infrastructure with LLM-as-judge calibrated against human labels ([Post 1 covers the eval setup]https://dev.to/joaopaulotr/building-an-evaluation-harness-for-financial-rag-what-i-learned-about-llm-as-judge-calibration-5030).
After Phase 2, I had a baseline: Recall@6 of 0.83, and about 47 out of 100 queries answered correctly (verified by human labels).
The 5 failure modes
I categorized every error in my 100-query eval set. Here's what I found:
1. Missing documents (the big one)
I was debugging why Johnson & Johnson queries always failed. 9 out of 17 retrieval misses were J&J documents. I assumed it was a semantic similarity problem since all SEC filings use nearly identical language.
It wasn't. The documents were never downloaded.
An audit revealed that 5 out of 84 documents in the dataset were missing from my vector store, and 2 more were corrupted during PDF extraction (AMD and KraftHeinz had 5 and 0 chunks respectively instead of 150+). After fixing this, retrieval misses dropped from 16 to 6.
| Metric | Before corpus fix | After corpus fix | Delta |
|---|---|---|---|
| Recall@6 | 0.840 | 0.940 | +0.100 |
| Retrieval misses | 16 | 6 | -10 |
This is the lesson I keep coming back to: I spent days implementing algorithmic improvements when the root cause was incomplete data. In production, this happens constantly. Pipelines that fail silently, documents that get skipped, files that corrupt during processing. No amount of reranking or query rewriting fixes missing data.
2. Cross-document confusion within the same company
After ingesting the missing J&J documents, a new problem appeared. The retriever sometimes pulled chunks from J&J's 2022 10-K when the query was about J&J's 2023 8-K. Same company, wrong document.
My metadata filter extracted company name from the query and filtered Qdrant before retrieval. But filtering by "Johnson & Johnson" doesn't help when both the right and wrong documents are from Johnson & Johnson.
Fix: extended the filter to extract company + year + document type. This helped but didn't fully resolve it since the language overlap between a company's own filings across years is even higher than between different companies.
3. Numerical extraction errors
The model retrieves the right document, finds the right section, but produces the wrong number. Quick ratio of 1.76 when the correct answer is 1.57. Dividend payout ratio of 83.56% when it should be 80%.
These errors are invisible to a standard LLM judge because the answer is well-structured, uses correct methodology, and sounds right. More on this below.
4. Hybrid retrieval noise
I added BM25 alongside dense retrieval, fused via Reciprocal Rank Fusion. The theory: keyword matching catches exact terms that semantic search misses.
The result: Precision dropped from 0.422 to 0.405 and MRR dropped from 0.646 to 0.594. BM25 was pulling in keyword-matching chunks that weren't semantically relevant, pushing the correct chunks lower in the ranking. Not every theoretical improvement works in practice.
5. Judge inflation
This one is subtle and easy to miss.
My LLM judge (A|GT, with ground truth) said 63 out of 100 answers were correct. When I checked against 30 human labels, the real number was about 47. The judge was inflating scores by 34%.
Why? The judge evaluated reasoning quality and methodology, not numerical accuracy. An answer that says "$1,608M" when the correct answer is "$2,018M" got a 5/5 because the explanation was well-structured.
I built a stricter judge (v2) with explicit numerical comparison rules. It brought the estimate down to 51/100, much closer to reality. But it overcorrected: TPR dropped from 0.93 to 0.71, meaning it now rejects 29% of correct answers.
| Judge | Reported correct | TPR | TNR | Inflation vs human |
|---|---|---|---|---|
| v1 (lenient) | 63/100 | 0.93 | 0.75 | +16 |
| v2 (strict) | 51/100 | 0.71 | 0.94 | +4 |
| Human | ~47/100 | — | — | — |
There is no perfect judge. You choose your bias: false positives (approve wrong answers) or false negatives (reject correct answers). The only ground truth is human labels.
What actually moved the needle
Here's the full progression:
| Phase | Recall@6 | Accuracy (human) | What changed |
|---|---|---|---|
| Phase 2 (baseline) | 0.830 | ~47/100 | Nothing, first measurement |
| Phase 3 (algorithmic fixes) | 0.840 | ~47/100 | Hybrid retrieval + metadata filtering |
| Phase 3b (corpus fix) | 0.940 | ~47/100 | Ingested 5 missing + 2 corrupted docs |
The uncomfortable finding: retrieval improved significantly (83% to 94%) but real accuracy stayed at ~47%. The bottleneck shifted from retrieval to generation. The model now gets the right document but still produces wrong numbers.
What I didn't implement (and why)
| Approach | Why I skipped it |
|---|---|
| Semantic chunking | Requires full re-ingestion, uncertain impact vs current chunking |
| Reranker (Cohere/cross-encoder) | Adds latency and cost, lower priority than data quality fixes |
| Query rewriting / HyDE | Adds LLM call per query, cross-company noise likely persists |
| Contextual retrieval | High potential but requires full re-ingestion pipeline |
These aren't bad ideas. They're deferred decisions. The point is knowing when to stop iterating on retrieval and start looking at generation quality, which is where the bottleneck is now.
What I learned
Data quality beats algorithms. My most sophisticated fix (hybrid retrieval with RRF) had negative impact on two metrics. My simplest fix (downloading missing files) had the largest positive impact of the entire project.
Your eval is only as good as your judge. Three different evaluation approaches gave me three different accuracy numbers: 63%, 51%, and 47%. Without human calibration, I would have reported 63% and believed the pipeline was improving when it wasn't.
Know when the bottleneck shifts. I could keep optimizing retrieval, but with Recall at 94%, the remaining errors are in generation. The next improvements need to target how the model extracts and reasons about numbers, not how it finds documents.
Repo: financebench-rag-eval
References
- FinanceBench — Patronus AI
- 6 RAG Evals — Jason Liu
- LLM Evals FAQ — Hamel Husain
- Arize Phoenix
- AI Builder's Handbook — LevelUp Labs
Top comments (0)