DEV Community

Saulo Linares
Saulo Linares

Posted on

I rebuilt my Financial Mentor retrieval from scratch. Here's everything the RAG stack taught me

From stuffing JSON into Claude to GraphRAG, hybrid search, CRAG, and adversarial evaluation — the complete honest account

The problem with FinMentor started before I had the vocabulary to describe it...

Users were asking reasonable questions about their portfolios. The system was answering them. Some answers were right. Some answers were wrong. And I couldn't explain the pattern because I hadn't looked at what was actually flowing into the model.

When I looked: every query was receiving the full IBKR portfolio snapshot. JSON format. Five positions, monthly P&L, thirty transactions, account metadata. The same 847 tokens regardless of what was asked. A question about sector concentration got the full transaction history. A question about a single ticker got every other position. Maybe 10% of the context was relevant to any given question. The other 90% was noise competing for attention and billing me for the privilege.

I wasn't doing retrieval. I was doing copy-paste with extra steps.

Act 1 — Three stages, and the bug hiding in one of them
The fix for the naive approach is conceptually simple: index the data, retrieve only what's relevant, inject that into the generation context. Three stages. Each one a decision point.

The index stage is where chunking happens. Fixed-size splitting on token boundaries is fast and mindless. Semantic splitting on logical boundaries — paragraph breaks, section transitions — is slower and produces chunks that hold together as units of meaning. For a portfolio system where one position's data should stay together, hierarchical chunking turned out to matter: a parent chunk covering the full position with child chunks covering individual attributes, each retrievable at the right granularity depending on what the query needs.

The retrieve stage is where I found the bug I couldn't explain.

FinMentor's original index included real-time price data. Current price, daily P&L, mark-to-market position value. These were indexed alongside stable data — ticker, cost basis, account metadata. Every time the index ran a refresh cycle, the prices updated. But queries were hitting slightly stale snapshots between refreshes. At $875/share for NVDA, a one-day stale index on a 100-share position means a $2,250 error in a user's reported portfolio value. The system was generating confident specific numbers that were wrong in a way that was completely invisible until you compared them to the actual IBKR feed.

The classification is simple once you see it: real-time financial data should never be in the vector index. It must be fetched fresh at query time and injected directly. Index staleness in a financial context is not a performance bug. It is an accuracy bug. The retrieval stage is where that distinction has to be enforced.

Act 2 — The wall I hit when real users arrived
Testing FinMentor's retrieval on my own queries produced numbers I was proud of. Context recall at 0.89, faithfulness at 0.91. I had built the golden dataset myself and I knew what vocabulary was in the index, because I had written the index.

The first real user session broke things immediately.

The user asked about Goldman's view on tech. My test dataset used "Goldman Sachs." The index contained "GS Equity Research," "Goldman analysts," and "GS research team" across different analyst reports — three distinct entity representations in embedding space, each in a different vector neighborhood. A query mentioning "Goldman" retrieved one of them reliably and missed the other two. Context recall on user queries: 0.58.

This is vocabulary mismatch, and it has a predictable structure in financial data. Tickers versus company names. Fed versus Federal Reserve versus FOMC. Formal query language versus casual conversational language. The gap between how you write test queries and how users actually ask questions is wider than intuition suggests.

Three fixes:

Hybrid search merges sparse retrieval (BM25, which matches exact token strings) with dense retrieval (semantic embeddings, which match meaning). Reciprocal Rank Fusion combines the two ranked lists into one. BM25 catches the exact string "GS Equity Research" even when the query says "Goldman." Semantic search catches the conceptual overlap even when the vocabulary doesn't match. Together they cover the cases either misses alone.

HyDE (Hypothetical Document Embeddings) handles vague queries like "what are people saying about tech." Instead of searching with the query's embedding, generate a hypothetical analyst excerpt that would answer the question, then search with that excerpt's embedding. The hypothetical document uses vocabulary the index actually contains. The embedding lands in a richer neighborhood.

Query decomposition handles multi-intent questions. "How is my tech concentration relative to sector benchmarks and what does Goldman say about it?" is two retrievals. Decompose it before retrieval, run both, merge before generation.

The debugging sequence that saved me the most time: work upstream from the output. Bad answer → look at retrieved chunks first. Chunks look fine → look at the index. Index looks fine → look at the routing. Never debug generation when the problem might be retrieval. The layers are distinct. Test them in order.

Act 3 — The eval score that meant nothing
After the retrieval improvements, RAGAS faithfulness hit 0.92. I considered the system production-ready. The first user feedback arrived shortly after: the system had told them something about their portfolio that wasn't true. Confident tone, specific numbers, wrong answer.

I went back through the golden dataset. Every question in it had an answer in the index. I had been measuring how well the system answers questions that can be answered. I had never measured what it does when it can't.

The adversarial category is questions where the answer does not exist in the knowledge base: what did Warren Buffett say about semiconductors this quarter, what is the Federal Reserve's next rate decision, should I buy Tesla stock. The correct behavior is explicit refusal. The incorrect behavior is a confident synthesized answer from training data that sounds grounded and isn't.

Before adding adversarial cases to the eval set, the system's adversarial pass rate was 60%. For every ten out-of-scope questions, four received answers instead of refusals. In a financial context that is not a quality metric. It is a liability metric.

The fix was a combination of hybrid retrieval (higher precision means fewer low-relevance chunks reaching generation), a tighter system prompt (explicit instruction to refuse when context is insufficient), and CRAG — a relevance assessment gate before generation that explicitly routes POOR-relevance retrievals to refusal rather than generation. After all three: 95% adversarial pass rate.

Two additions to the evaluation framework that changed what I could measure:

LLM-as-judge without chain-of-thought exhibits two systematic biases: longer answers score higher (verbosity bias) and early claims in an answer are weighted more heavily than late ones (position bias). Forcing the judge to enumerate specific factual claims and check each against the retrieved context before assigning a score — the G-Eval approach — eliminates both. The judge evaluates what it actually says it's evaluating.

User query variants generated by Claude — "how's my apple stock doing," "what do I have in aapl," "show me my apple holdings" — cover vocabulary surface area that author-written queries miss entirely. They're not a replacement for real session data. But they're meaningfully better than testing only with queries you write

The diagnostic table that made the failure modes legible:

Faithfulness Context Recall What it means
High Low Fix retrieval — generation is fine
Low High Fix generation — retrieval is fine
Both low Fix retrieval first, generation compounds it

Each combination points to a specific layer.
Learn to read the combination before touching anything.

Act 4 — The question the vector index couldn't answer
A user asked: "How are AAPL and MSFT related in terms of sector risk?"

Both chunks were in the index. Both retrieved on their own queries fine. AAPL's chunk mentioned Technology sector. MSFT's chunk mentioned Technology sector. The retrieval surfaced both. And then generation produced an answer describing each stock independently, because the relationship between them — shared sector membership, correlated drawdown behavior, combined concentration risk — existed nowhere in the index as text. The connection was implicit. Retrieval needs it explicit.

GraphRAG stores entities as nodes and relationships as edges. After extracting graph elements from the corpus via Claude — entities with types, relationships with labels — a NetworkX directed graph carries what the vector index couldn't. The edge AAPL --[belongs_to]--> Technology Sector and MSFT --[belongs_to]--> Technology Sector are now first-class facts. When the user asks how Fed policy affects their tech positions, the system traverses:

Fed Rate Hike → affects → Tech Sector Valuations → compresses → Growth Stock Multiples → AAPL is_a Growth Stock
The answer assembles from the path. No single document contained it.

Entity resolution is the practical win that makes this useful at scale. Across fifty analyst reports, Goldman Sachs appears under three distinct strings. A vector index treats them as three entities. A knowledge graph, after a one-time resolution pass at index time, merges all three into one canonical node with all fifty reports attached. Every query variant resolves to the same place. The cost is paid once. The benefit is permanent.

The counter-case matters as much as the case: ten thousand independent FAQ articles with no meaningful entity relationships. GraphRAG adds extraction cost per chunk, graph construction latency, entity resolution maintenance, and traversal overhead — with zero retrieval benefit if the graph has no edges worth traversing. The mistake is choosing GraphRAG because it sounds more sophisticated. The question is whether the data has relationships that matter for the queries you need to answer.

CRAG — corrective RAG — is the capability that cuts across all retrieval architectures. Assess the relevance of retrieved chunks before generation. Route POOR-relevance results to explicit refusal rather than hallucinated synthesis. The explicit "I don't have that information" is not a failure. It is the correct answer when the knowledge base doesn't contain what was asked. Building a system that knows when it doesn't know is the hardest single thing on this list.

What RAG actually is
Not a technique. A stack. Each layer exists because the layer below it fails in a specific, predictable way.

Vector search fails on vocabulary mismatch → hybrid search adds BM25. Hybrid search fails on vague conceptual queries → HyDE generates a better search key. Retrieval fails silently, passing low-quality context to generation → CRAG adds a relevance gate. Documents fail to represent relationships between entities → GraphRAG makes connections explicit. Evaluation fails because author-written test queries don't represent real user language → real session data and LLM-paraphrased variants extend coverage.

The stack isn't optional. Each layer is load-bearing. A system that has retrieval but no evaluation can't catch the failure modes that only appear in production. A system that has evaluation but no adversarial cases measures the happy path and misses the liability surface. A system with great retrieval on formal vocabulary and zero coverage on casual vocabulary works for the developer and fails the user.

The version of FinMentor that ran on full JSON snapshots was technically a working system. It generated answers. Some answers were right. The version that runs on an indexed, routed, hybrid-retrieved, CRAG-gated, RAGAS-evaluated, adversarially-tested pipeline is a different thing. Not because any individual piece is clever, but because each layer addresses a specific failure mode the layer below left uncovered.

That's what the stack is. Four weeks of finding the specific ways it breaks, and adding a layer each time.

Top comments (0)