João Paulo Traguetta Rufino

Posted on Jun 4

From 10% to 57% Accuracy on FinanceBench: What Actually Moved the Needle

#ai #rag #python #architecture

A month ago I started building a RAG system for financial document Q&A. First test: 2 out of 20 questions correct. Last test: 57% accuracy on 100 queries, validated against human labels.

This post is about which improvements actually worked, which didn't, and the one finding that surprised me most.

The setup

The system answers questions about SEC filings (10-K, 10-Q, earnings reports) from 84 public companies, evaluated against FinanceBench by Patronus AI. 150 expert-annotated Q&A pairs with ground truth answers.

Final stack: GPT-4o for generation, text-embedding-3-small for embeddings, Qdrant for vector storage (hybrid dense + BM25), LangGraph for orchestration (CRAG pipeline with document grading), BAAI/bge-reranker-base for reranking, and contextual retrieval with metadata prefixes on every chunk.

Full repo: financebench-rag-eval

The progression

Phase	Recall@6	Accuracy (human)	What changed
Baseline	—	10% (20 queries)	First test, vanilla RAG
Phase 2	0.830	~47% (100 queries)	Eval infrastructure built
Phase 3b	0.940	~47%	Corpus fix + metadata filter + hybrid
Phase 4	0.950	~57%	CRAG pipeline + rerank + contextual retrieval + GPT-4o

Two things stand out. Retrieval went from 83% to 95% but accuracy stayed at 47%. Then I changed the generation model and accuracy jumped to 57%. More on that below.

What actually worked

1. Corpus audit (+10pp recall, zero code change)

I spent two weeks implementing hybrid retrieval, metadata filtering, and query routing. Recall went from 83% to 84%. Then I ran an audit and found that 5 documents were never ingested and 2 were corrupted during PDF extraction.

Fixing that took 30 minutes. Recall jumped to 94%.

9 out of 17 retrieval misses were from Johnson & Johnson documents that simply weren't in the vector store. The pipeline gave no error. It just retrieved chunks from other companies and generated a confident wrong answer.

Lesson: before you optimize retrieval, verify your data is actually all there.

2. CRAG pipeline (replaced agent loop)

The original pipeline was a LangGraph agent that decided when to retrieve and when to answer. Sometimes it made 5-6 retrieval calls, pulling in noise from unrelated companies.

I replaced it with an explicit graph: query_analysis → retrieve → rerank → grade_documents → generate. If the grading step says the chunks are irrelevant, it relaxes the metadata filter and retries once.

This made the pipeline predictable, cheaper (fewer API calls), and easier to debug. Every step has a fixed role instead of the LLM deciding the flow.

3. Contextual retrieval prefixes

SEC filings use nearly identical language across companies. "Net revenues increased" appears in every 10-K. So I prepended each chunk with metadata before embedding:

Company: Johnson & Johnson | Document: 10K | Year: 2022

This changes the embedding to capture where the chunk comes from, not just what it says. Combined with metadata filtering at query time, it reduced cross-company retrieval errors.

4. Switching from GPT-4o-mini to GPT-4o (+10pp accuracy)

This was the biggest finding of the project.

After all the retrieval improvements, accuracy was stuck at ~47%. Recall was at 95%. The pipeline was retrieving the right documents but the model was extracting wrong numbers or saying "I don't know" when the answer was right there in the context.

I switched generation from GPT-4o-mini to GPT-4o. Accuracy went from ~47% to ~57%. Same retrieval, same chunks, same prompts. Just a better model.

The bottleneck was never the retrieval. It was the generation model's ability to reason about financial data.

What didn't work

Hybrid retrieval (dense + BM25). Added BM25 via FastEmbedSparse with RRF fusion. Faithfulness improved (+0.78) because BM25 catches exact number matches, but Precision and MRR dropped. BM25 pulled in keyword-matching chunks that weren't semantically relevant, pushing correct chunks lower in the ranking.

Judge v1 without calibration. My LLM judge said 63 out of 100 answers were correct. When I checked against 30 human labels, the real number was 47. The judge inflated scores by 34% because it evaluated fluency, not numerical accuracy. An answer saying "$1,608M" when the correct answer was "$2,018M" got 5/5 because it was well-structured.

I built a stricter judge (v2) with explicit numerical comparison rules. TNR improved from 0.75 to 0.94.

The eval system

Every number in this post comes from a multi-tier eval:

Tier 1 (retrieval): Recall@6, Precision@6, MRR. Measured separately from generation so I could tell where the pipeline was failing.

Tier 2 (generation): LLM-as-judge scoring context relevance, faithfulness, and answer correctness against ground truth. Two judge versions: v1 (lenient, fluency-biased) and v2 (strict, numerical tolerance enforced).

Calibration: Every judge validated against 30 human labels. TPR and TNR reported. Final calibration: TPR=0.82, TNR=0.92. Without this step, I would have reported 63% accuracy instead of the real 47%.

Cost

Metric	Value
Cost per query	$0.017
Average latency	40.7s
Tokens per query	~6,900
Total eval cost (100 queries)	~$1.74

What I'd do differently

Start with a corpus audit before any algorithmic work. I could have saved two weeks.

Build the eval infrastructure in week 1, not week 4. Without measurement, I was guessing. With measurement, every change had a clear before/after.

Test the generation model earlier. I assumed GPT-4o-mini was "good enough" and spent weeks optimizing retrieval. The model swap should have been the first experiment, not the last.

What's next

The 57% accuracy is competitive for RAG on FinanceBench (GPT-4 with full document context scores ~60-65% on this benchmark). But there's room to improve: better table extraction from PDFs, larger chunk sizes to preserve financial tables, and multi-step reasoning for complex calculations.

These are documented as future work in the repo.

Repo: financebench-rag-eval

References

FinanceBench — Patronus AI
6 RAG Evals — Jason Liu
LLM Evals FAQ — Hamel Husain
AI Builder's Handbook — LevelUp Labs
LangGraph docs

DEV Community