DEV Community

Saulo Linares
Saulo Linares

Posted on

I built a financial AI agent and watched vector search miss the two most relevant positions in the portfolio

I built a financial AI agent to analyze portfolio positions and answer questions about market exposure. The retrieval system worked fine on simple queries. Then I asked it something relational.

"How does Fed policy affect tech positions?"

The system retrieved a P&L summary with a cosine similarity score of 0.237. AAPL came back at 0.031. MSFT at 0.018. Both below the retrieval threshold. The two most relevant positions in the portfolio — near misses.

I spent time checking chunking strategy, embedding quality, query formulation. All reasonable things to check. But the deeper issue was different: similarity search was solving the wrong problem.

The query depended on a causal chain:

Fed policy → rate hikes → discount rates → growth stock duration → tech valuation sensitivity → portfolio positions

None of those relationships appear as similar text in any document. They exist as connections between entities — not as proximity in embedding space.

That distinction is the whole lesson

What vector search actually optimizes for

Embedding models compress meaning into vectors. They are very good at finding text that is semantically related to a query. "Fed policy" and "interest rates" will be geometrically close. "Tech valuations" and "growth stocks" will cluster together.

What they cannot encode is directionality. "Fed policy affects interest rates" and "interest rates affect Fed policy" produce similar embeddings. The causal arrow is invisible to the model.

To be precise: the low scores on AAPL and MSFT likely reflect a combination of chunking quality and query formulation — not a categorical failure of vector search. A better-engineered pipeline would do better. But even a well-tuned vector index has no native concept of "affects" or "belongs_to." That gap is structural, not a tuning problem

What the knowledge graph found

I added a graph layer. Claude ran entity and relationship extraction on 6 document chunks and produced 36 entities and 39 relationships. Nothing was hand-authored. The extraction prompt asked for entity types (company, sector, metric, event, concept) and relationship types (affects, belongs_to, sensitive_to, reported_by).

The traversal on the same query:

Federal Reserve → affects → Rate hike → affects → Discount rate → affects → Tech valuations → sensitive_to → AAPL, MSFT

The chain assembled itself from the graph structure. No document contained the sentence "Fed policy affects your tech positions." But the extracted relationships between entities did contain that information — just not as text similarity.

Knowledge graph: 36 entities, 39 relationships, traversal path highlighted in teal

This is a proof of concept on 6 chunks, not a production system. Production GraphRAG requires entity disambiguation, ontology validation, and handling extraction errors at scale. The concept is real. The engineering cost is significant

When graphs are overkill

GraphRAG is not always the right answer. For a corpus of independent FAQ articles with no meaningful entity relationships, graph extraction adds cost, query latency, and maintenance overhead with no retrieval benefit. Standard hybrid RAG — BM25 plus semantic search merged with reciprocal rank fusion — handles that case better.

The decision rule: use GraphRAG when relationships between entities matter as much as document content. Use hybrid RAG when they do not.

The refusal that mattered more than the retrieval
After fixing the relational query problem, I tested the opposite case. I asked about a stock that was not in the dataset at all.

This is where most RAG systems quietly fail. Retrieval returns whatever is closest — even if nothing is actually relevant — and the model generates a confident answer from weak context.

I added CRAG: Corrective RAG. The system scores its own retrieval quality before generating. If the maximum relevance score falls below a threshold, it does not generate. It declines instead.

Max retrieval confidence on the out-of-dataset query: 0.10.

The system responded: "I don't have reliable information about this in my knowledge base. Please consult a qualified financial advisor."

CRAG confidence scoring: three scenarios — high confidence answer, partial answer with caveat, low confidence refusal

The refusal behavior itself is not impressive. Any system with a low enough threshold will refuse. What matters is the mechanism: the self-assessment loop runs between retrieval and generation, not after. The system decides whether to generate before it generates.

For a financial AI that ordering is important. A confident hallucination about a portfolio position is a different category of failure than a retrieval miss

The diagnostic framework I use now
When retrieval fails, the metric combination tells you where to look:

Faithfulness Context recall What it means
High Low Fix retrieval — generation is fine
Low High Fix generation — retrieval is fine
Both low Fix retrieval first
High High Working correctly

Faithfulness measures whether the answer came from retrieved context. Context recall measures whether retrieval surfaced the right chunks. A system can score high faithfulness on the wrong retrieved context — which is why both metrics are needed.

One thing I changed after running evals: I stopped writing test queries myself. Author-written queries use vocabulary that matches the index. Real user queries do not. The gap between those two populations is where most retrieval failures live.

What I would do differently
Three things that were not obvious at the start:

Real-time financial data should not be in a vector index. Indexed prices go stale the moment the market moves. Pull fresh from the data source at query time for any price-sensitive question. Use the index only for slow-changing data: analyst reports, historical transactions, reference documents.

Test on queries you did not write. Use an LLM to generate casual paraphrases of your formal test questions. "What is my AAPL position" and "how am I doing with apple stock" should retrieve the same thing. Often they do not.

Adversarial cases are not optional. A golden dataset without questions the system cannot answer will not catch the failure mode that matters most. For a financial AI, incorrect confident answers are a different category of problem than incorrect uncertain answers.

Top comments (0)