DEV Community: João Paulo Traguetta Rufino

From 10% to 57% Accuracy on FinanceBench: What Actually Moved the Needle

João Paulo Traguetta Rufino — Thu, 04 Jun 2026 19:00:49 +0000

A month ago I started building a RAG system for financial document Q&A. First test: 2 out of 20 questions correct. Last test: 57% accuracy on 100 queries, validated against human labels.

This post is about which improvements actually worked, which didn't, and the one finding that surprised me most.

The setup

The system answers questions about SEC filings (10-K, 10-Q, earnings reports) from 84 public companies, evaluated against FinanceBench by Patronus AI. 150 expert-annotated Q&A pairs with ground truth answers.

Final stack: GPT-4o for generation, text-embedding-3-small for embeddings, Qdrant for vector storage (hybrid dense + BM25), LangGraph for orchestration (CRAG pipeline with document grading), BAAI/bge-reranker-base for reranking, and contextual retrieval with metadata prefixes on every chunk.

Full repo: financebench-rag-eval

The progression

Phase	Recall@6	Accuracy (human)	What changed
Baseline	—	10% (20 queries)	First test, vanilla RAG
Phase 2	0.830	~47% (100 queries)	Eval infrastructure built
Phase 3b	0.940	~47%	Corpus fix + metadata filter + hybrid
Phase 4	0.950	~57%	CRAG pipeline + rerank + contextual retrieval + GPT-4o

Two things stand out. Retrieval went from 83% to 95% but accuracy stayed at 47%. Then I changed the generation model and accuracy jumped to 57%. More on that below.

What actually worked

1. Corpus audit (+10pp recall, zero code change)

I spent two weeks implementing hybrid retrieval, metadata filtering, and query routing. Recall went from 83% to 84%. Then I ran an audit and found that 5 documents were never ingested and 2 were corrupted during PDF extraction.

Fixing that took 30 minutes. Recall jumped to 94%.

9 out of 17 retrieval misses were from Johnson & Johnson documents that simply weren't in the vector store. The pipeline gave no error. It just retrieved chunks from other companies and generated a confident wrong answer.

Lesson: before you optimize retrieval, verify your data is actually all there.

2. CRAG pipeline (replaced agent loop)

The original pipeline was a LangGraph agent that decided when to retrieve and when to answer. Sometimes it made 5-6 retrieval calls, pulling in noise from unrelated companies.

I replaced it with an explicit graph: query_analysis → retrieve → rerank → grade_documents → generate. If the grading step says the chunks are irrelevant, it relaxes the metadata filter and retries once.

This made the pipeline predictable, cheaper (fewer API calls), and easier to debug. Every step has a fixed role instead of the LLM deciding the flow.

3. Contextual retrieval prefixes

SEC filings use nearly identical language across companies. "Net revenues increased" appears in every 10-K. So I prepended each chunk with metadata before embedding:

Company: Johnson & Johnson | Document: 10K | Year: 2022

This changes the embedding to capture where the chunk comes from, not just what it says. Combined with metadata filtering at query time, it reduced cross-company retrieval errors.

4. Switching from GPT-4o-mini to GPT-4o (+10pp accuracy)

This was the biggest finding of the project.

After all the retrieval improvements, accuracy was stuck at ~47%. Recall was at 95%. The pipeline was retrieving the right documents but the model was extracting wrong numbers or saying "I don't know" when the answer was right there in the context.

I switched generation from GPT-4o-mini to GPT-4o. Accuracy went from ~47% to ~57%. Same retrieval, same chunks, same prompts. Just a better model.

The bottleneck was never the retrieval. It was the generation model's ability to reason about financial data.

What didn't work

Hybrid retrieval (dense + BM25). Added BM25 via FastEmbedSparse with RRF fusion. Faithfulness improved (+0.78) because BM25 catches exact number matches, but Precision and MRR dropped. BM25 pulled in keyword-matching chunks that weren't semantically relevant, pushing correct chunks lower in the ranking.

Judge v1 without calibration. My LLM judge said 63 out of 100 answers were correct. When I checked against 30 human labels, the real number was 47. The judge inflated scores by 34% because it evaluated fluency, not numerical accuracy. An answer saying "$1,608M" when the correct answer was "$2,018M" got 5/5 because it was well-structured.

I built a stricter judge (v2) with explicit numerical comparison rules. TNR improved from 0.75 to 0.94.

The eval system

Every number in this post comes from a multi-tier eval:

Tier 1 (retrieval): Recall@6, Precision@6, MRR. Measured separately from generation so I could tell where the pipeline was failing.

Tier 2 (generation): LLM-as-judge scoring context relevance, faithfulness, and answer correctness against ground truth. Two judge versions: v1 (lenient, fluency-biased) and v2 (strict, numerical tolerance enforced).

Calibration: Every judge validated against 30 human labels. TPR and TNR reported. Final calibration: TPR=0.82, TNR=0.92. Without this step, I would have reported 63% accuracy instead of the real 47%.

Cost

Metric	Value
Cost per query	$0.017
Average latency	40.7s
Tokens per query	~6,900
Total eval cost (100 queries)	~$1.74

What I'd do differently

Start with a corpus audit before any algorithmic work. I could have saved two weeks.

Build the eval infrastructure in week 1, not week 4. Without measurement, I was guessing. With measurement, every change had a clear before/after.

Test the generation model earlier. I assumed GPT-4o-mini was "good enough" and spent weeks optimizing retrieval. The model swap should have been the first experiment, not the last.

What's next

The 57% accuracy is competitive for RAG on FinanceBench (GPT-4 with full document context scores ~60-65% on this benchmark). But there's room to improve: better table extraction from PDFs, larger chunk sizes to preserve financial tables, and multi-step reasoning for complex calculations.

These are documented as future work in the repo.

Repo: financebench-rag-eval

References

FinanceBench — Patronus AI
6 RAG Evals — Jason Liu
LLM Evals FAQ — Hamel Husain
AI Builder's Handbook — LevelUp Labs
LangGraph docs

5 Failure Modes I Found in My Financial RAG (And the One That Actually Mattered)

João Paulo Traguetta Rufino — Sat, 30 May 2026 14:51:24 +0000

My RAG system for financial document Q&A was stuck at 53% accuracy. I spent two weeks implementing hybrid retrieval, metadata filtering, and query routing. Accuracy went to 58%.

Then I ran a corpus audit and found that 5 documents were never ingested and 2 were corrupted. Fixing that alone pushed recall from 83% to 94%.

The most impactful improvement in the entire project took 30 minutes and zero lines of new code.

The setup

Quick context: I'm building a RAG system evaluated against FinanceBench (Patronus AI), a benchmark with 150 expert-annotated Q&A pairs about SEC filings. The pipeline is GPT-4o-mini for generation, text-embedding-3-small for embeddings, and Qdrant as the vector store. Full eval infrastructure with LLM-as-judge calibrated against human labels ([Post 1 covers the eval setup]https://dev.to/joaopaulotr/building-an-evaluation-harness-for-financial-rag-what-i-learned-about-llm-as-judge-calibration-5030).

After Phase 2, I had a baseline: Recall@6 of 0.83, and about 47 out of 100 queries answered correctly (verified by human labels).

The 5 failure modes

I categorized every error in my 100-query eval set. Here's what I found:

1. Missing documents (the big one)

I was debugging why Johnson & Johnson queries always failed. 9 out of 17 retrieval misses were J&J documents. I assumed it was a semantic similarity problem since all SEC filings use nearly identical language.

It wasn't. The documents were never downloaded.

An audit revealed that 5 out of 84 documents in the dataset were missing from my vector store, and 2 more were corrupted during PDF extraction (AMD and KraftHeinz had 5 and 0 chunks respectively instead of 150+). After fixing this, retrieval misses dropped from 16 to 6.

Metric	Before corpus fix	After corpus fix	Delta
Recall@6	0.840	0.940	+0.100
Retrieval misses	16	6	-10

This is the lesson I keep coming back to: I spent days implementing algorithmic improvements when the root cause was incomplete data. In production, this happens constantly. Pipelines that fail silently, documents that get skipped, files that corrupt during processing. No amount of reranking or query rewriting fixes missing data.

2. Cross-document confusion within the same company

After ingesting the missing J&J documents, a new problem appeared. The retriever sometimes pulled chunks from J&J's 2022 10-K when the query was about J&J's 2023 8-K. Same company, wrong document.

My metadata filter extracted company name from the query and filtered Qdrant before retrieval. But filtering by "Johnson & Johnson" doesn't help when both the right and wrong documents are from Johnson & Johnson.

Fix: extended the filter to extract company + year + document type. This helped but didn't fully resolve it since the language overlap between a company's own filings across years is even higher than between different companies.

3. Numerical extraction errors

The model retrieves the right document, finds the right section, but produces the wrong number. Quick ratio of 1.76 when the correct answer is 1.57. Dividend payout ratio of 83.56% when it should be 80%.

These errors are invisible to a standard LLM judge because the answer is well-structured, uses correct methodology, and sounds right. More on this below.

4. Hybrid retrieval noise

I added BM25 alongside dense retrieval, fused via Reciprocal Rank Fusion. The theory: keyword matching catches exact terms that semantic search misses.

The result: Precision dropped from 0.422 to 0.405 and MRR dropped from 0.646 to 0.594. BM25 was pulling in keyword-matching chunks that weren't semantically relevant, pushing the correct chunks lower in the ranking. Not every theoretical improvement works in practice.

5. Judge inflation

This one is subtle and easy to miss.

My LLM judge (A|GT, with ground truth) said 63 out of 100 answers were correct. When I checked against 30 human labels, the real number was about 47. The judge was inflating scores by 34%.

Why? The judge evaluated reasoning quality and methodology, not numerical accuracy. An answer that says "$1,608M" when the correct answer is "$2,018M" got a 5/5 because the explanation was well-structured.

I built a stricter judge (v2) with explicit numerical comparison rules. It brought the estimate down to 51/100, much closer to reality. But it overcorrected: TPR dropped from 0.93 to 0.71, meaning it now rejects 29% of correct answers.

Judge	Reported correct	TPR	TNR	Inflation vs human
v1 (lenient)	63/100	0.93	0.75	+16
v2 (strict)	51/100	0.71	0.94	+4
Human	~47/100	—	—	—

There is no perfect judge. You choose your bias: false positives (approve wrong answers) or false negatives (reject correct answers). The only ground truth is human labels.

What actually moved the needle

Here's the full progression:

Phase	Recall@6	Accuracy (human)	What changed
Phase 2 (baseline)	0.830	~47/100	Nothing, first measurement
Phase 3 (algorithmic fixes)	0.840	~47/100	Hybrid retrieval + metadata filtering
Phase 3b (corpus fix)	0.940	~47/100	Ingested 5 missing + 2 corrupted docs

The uncomfortable finding: retrieval improved significantly (83% to 94%) but real accuracy stayed at ~47%. The bottleneck shifted from retrieval to generation. The model now gets the right document but still produces wrong numbers.

What I didn't implement (and why)

Approach	Why I skipped it
Semantic chunking	Requires full re-ingestion, uncertain impact vs current chunking
Reranker (Cohere/cross-encoder)	Adds latency and cost, lower priority than data quality fixes
Query rewriting / HyDE	Adds LLM call per query, cross-company noise likely persists
Contextual retrieval	High potential but requires full re-ingestion pipeline

These aren't bad ideas. They're deferred decisions. The point is knowing when to stop iterating on retrieval and start looking at generation quality, which is where the bottleneck is now.

What I learned

Data quality beats algorithms. My most sophisticated fix (hybrid retrieval with RRF) had negative impact on two metrics. My simplest fix (downloading missing files) had the largest positive impact of the entire project.

Your eval is only as good as your judge. Three different evaluation approaches gave me three different accuracy numbers: 63%, 51%, and 47%. Without human calibration, I would have reported 63% and believed the pipeline was improving when it wasn't.

Know when the bottleneck shifts. I could keep optimizing retrieval, but with Recall at 94%, the remaining errors are in generation. The next improvements need to target how the model extracts and reasons about numbers, not how it finds documents.

Repo: financebench-rag-eval

References

FinanceBench — Patronus AI
6 RAG Evals — Jason Liu
LLM Evals FAQ — Hamel Husain
Arize Phoenix
AI Builder's Handbook — LevelUp Labs

Building an Evaluation Harness for Financial RAG: What I Learned About LLM-as-Judge Calibration

João Paulo Traguetta Rufino — Tue, 19 May 2026 22:12:31 +0000

I built a RAG system for financial document Q&A. It answers questions about SEC filings (revenue, margins, debt ratios) using 84 public company documents from the FinanceBench benchmark.

After running 100 queries, my LLM judge said 74% of answers were correct. The actual number was 27%.

This post is about how I found that gap, why it exists, and what I did about it.

The setup

The pipeline is straightforward: embed 84 SEC filings (10-K, 10-Q, earnings reports) into Qdrant with text-embedding-3-small, retrieve top-6 chunks per query, generate answers with GPT-4o-mini.

FinanceBench gives you 150 expert-annotated Q&A pairs with ground truth answers and source documents. I used 100 of them as my eval set.

I measured quality in two tiers:

Tier 1 — Retrieval. Did the system find the right document? I tracked Recall@6, Precision@6, and MRR.

Tier 2 — Generation. Is the answer any good? I used an LLM judge (GPT-4o-mini scoring 1-5) to evaluate Context Relevance, Answer Faithfulness, and Answer Relevance.

Retrieval: decent but not great

Metric	Value
Recall@6	0.830
Precision@6	0.422
MRR	0.646

83 out of 100 queries retrieved the correct source document. Not bad for vanilla semantic search with zero filtering.

The 17 misses were concentrated: Johnson & Johnson (9 misses across different doc types) and Adobe (5 misses). Together, 14 out of 17 failures came from just two companies.

Why? SEC filings use nearly identical language across companies. "Net revenues increased," "operating income was impacted by" — these phrases appear in every single 10-K. Embeddings can't reliably tell 3M's filing from Coca-Cola's when the language is this similar.

I confirmed metadata filtering fixes this. When I manually filtered Qdrant to only return chunks from the correct PDF, retrieval hit 100%. Automatic filtering (LLM extracts company from query, filters before retrieval) is the planned fix.

The judge lies

Here's where things got interesting.

Metric	Avg Score (1-5)
Context Relevance (C\|Q)	3.04
Answer Faithfulness (A\|C)	3.36
Answer Relevance (A\|Q)	3.96

The Answer Relevance judge classified 74 out of 100 answers as correct (score >= 4).

That felt too good for a system I knew was struggling. So I calibrated.

Calibration: the part nobody does

I took 30 query-answer pairs and manually compared them against FinanceBench's ground truth. My human accuracy was 27% — only 8 out of 30 were actually correct.

Then I checked the judge against my labels:

Metric	Value
TPR (sensitivity)	1.00
TNR (specificity)	0.55

TPR 1.00 means when an answer is correct, the judge always catches it. Good.

TNR 0.55 means when an answer is wrong, the judge only catches it 55% of the time. Almost half of wrong answers pass as correct.

Real example: the judge gave 5/5 to an answer saying "$1,608M" when the ground truth was "$2,018M." The response was well-structured, cited a source, used proper financial language. It just had the wrong number.

This is the core problem: the judge evaluates fluency, not factual accuracy. It can't verify numbers because it doesn't have the ground truth to compare against.

The fix: give the judge the answer key

I added a fourth metric — Answer Correctness (A|GT) — where the judge prompt includes the expected answer from FinanceBench alongside the model's response. Now the judge can actually check if "$1,608M" matches "$2,018M."

After adding A|GT:

Metric	Value
TPR	1.00
TNR	0.86

TNR went from 0.55 to 0.86. The judge now catches 86% of wrong answers.

With this calibrated judge, 53 out of 100 answers were correct. Not 74.

Two judges, two purposes

This isn't about one being better. They measure different things.

A|Q (no ground truth) simulates production. In a live system, you don't have the right answer — that's why the user is asking. This judge tells you if the response is coherent and relevant. Good for monitoring.

A|GT (with ground truth) is for development. When you have labeled data, you use it. This tells you if your pipeline is actually improving or if you're just getting more fluent wrong answers.

The mistake is using only A|Q during development and trusting the numbers. My pipeline looked like 74%. It was 53%.

What didn't work

Automatic metadata filtering via exact match. I tried extracting the company name with the LLM and filtering Qdrant by source filename. Problem: Qdrant's match filter does exact string matching, and "Johnson & Johnson" doesn't match JOHNSON_JOHNSON_2022_10K.pdf. Needs fuzzy or substring matching. Deferred to next phase.

Framework default judge prompts. Most RAG eval tools ship generic prompts that work for "does this make sense?" but fail for "is this number right?" If your domain requires factual precision, you need custom prompts and you need to calibrate them against human labels. There's no shortcut here.

Where things stand

Metric	Value
Retrieval Recall@6	0.830
Accuracy (calibrated)	53/100
Judge TPR	1.00
Judge TNR	0.86

The pipeline retrieves the right document 83% of the time but only gives the correct answer 53% of the time. The gap comes from retrieval misses (17%) and generation errors on correctly retrieved documents.

Next: systematic error analysis. Categorize every failure, pick the top 2 modes, fix them, measure impact.

Repo: financebench-rag-eval

References

FinanceBench — Patronus AI
6 RAG Evals — Jason Liu
LLM Evals FAQ — Hamel Husain