The Day Our Financial Chatbot Almost Cost a Client 1,00,000
Six months into production, our financial RAG chatbot faced its first real crisis. A hedge fund client asked: "What's our exposure to tech sector derivatives as of last Friday's close?"
The bot responded instantly—with data from two weeks ago.
No error. No warning. Just confidently wrong information that could have triggered a bad trading decision.
We caught it during an internal audit before the client acted on it. But that moment changed everything about how we think about LLM evaluation.
This is the story of building, scaling, and continuously improving a financial RAG system that now handles 10,000+ monthly queries with 92% resolution rate and sub-second responses. And more importantly, how we evaluate it to ensure it never makes that mistake again.
The Architecture: Multi-Source Financial RAG
Financial queries are uniquely challenging. They require:
- Real-time accuracy (stock prices, market movements)
- Historical context (past performance, trends)
- Regulatory knowledge (compliance requirements)
- Document understanding (earnings reports, SEC filings)
Our LangChain-based architecture integrates all four:
User Query
↓
[Query Understanding Layer]
→ Intent classification (pricing, analysis, compliance, reporting)
→ Entity extraction (tickers, dates, document types)
↓
[Multi-Source Retrieval]
→ Real-time market data API (current prices, volumes)
→ Vector store (historical reports, earnings calls)
→ Structured database (client positions, exposure limits)
→ Regulatory corpus (SEC rules, compliance guidelines)
↓
[Context Assembly & Ranking]
↓
[Generation with Source Attribution]
↓
Response with Citations
The magic—and the risk—is in how these sources are weighted and combined. A query about "Apple's revenue growth" needs both current stock data and historical financial statements. A compliance question needs regulatory documents first, market data second.
Scaling to 10,000+ Monthly Queries
Going from prototype to production at scale required solving three distinct challenges:
Challenge 1: Latency at Scale
Financial users expect instant responses. Sub-second isn't a nice-to-have—it's table stakes.
What we optimized:
- Parallel retrieval: All three sources queried simultaneously
- Chunking strategy: Variable chunk sizes based on document type (small for news, larger for reports)
- Caching layer: Frequent queries (like "current price of $SPY") served from cache with TTL validation
- Streaming responses: First token under 200ms, full response under 1 second
Result: 99.9% of queries under 800ms at peak load.
Challenge 2: Resolution Rate Engineering
Hitting 92% resolution rate wasn't accidental—it was engineered through systematic iteration.
The funnel approach:
Total Queries (100%)
↓
[Intent Recognition Failure] → 3% (escalate to human)
↓
[Retrieval Failure] → 2% (fallback to broader search)
↓
[Generation Failure] → 3% (rephrase, retry once)
↓
[Success] → 92% (resolved autonomously)
Each failure category had specific remediations:
- Intent failures: Expanded training data for edge cases
- Retrieval failures: Added hybrid search (keyword + semantic) for better recall
- Generation failures: Implemented self-verification step before returning
Challenge 3: Production Reliability
Financial systems can't go down. Period.
Our stack:
- Load balancing: Multiple model endpoints with automatic failover
- Rate limiting: Per-client quotas to prevent abuse
- Graceful degradation: Fallback to simpler models when primary is overloaded
- Automated recovery: Self-healing on common error patterns
The Monitoring Stack: LangSmith as Our Canary
This is where evaluation becomes inseparable from operations. LangSmith isn't just logging—it's our early warning system.
Performance Tracking: The Real-Time Dashboard
Every query generates a trace with critical metrics:
# Instrumentation example
from langsmith import traceable
from langchain.callbacks.tracers import LangSmithTracer
@traceable(run_type="chain", name="financial_rag")
def process_query(query: str, user_context: dict):
# Start timer
start = time.time()
# Track retrieval stages
market_data = retrieve_market_data(query)
documents = retrieve_documents(query)
positions = retrieve_positions(user_context)
# Log latency per source
trace.add_metadata({
"market_data_latency_ms": market_data.latency,
"documents_latency_ms": documents.latency,
"total_retrieval_ms": (time.time() - start) * 1000
})
# Track source contributions
trace.add_metadata({
"sources_used": ["market_api", "vector_store", "structured_db"],
"chunks_retrieved": len(documents.chunks)
})
return generate_response(market_data, documents, positions)
What we monitor in real-time:
- p95/p99 latency per query type
- Retrieval success rate (did we find relevant documents?)
- Source distribution (which knowledge sources are being used?)
- Token usage per query (cost tracking)
- Error rate by category
Error Analysis: Finding the Needles
When something fails, we need to know why—immediately.
Error categorization pipeline:
- Runtime exceptions → Alert on-call engineer
- Empty retrieval → Flag for retrieval tuning
- Low confidence generation → Log for offline analysis
- User feedback (thumbs down) → Prioritize for review
The weekly review ritual:
Every Monday, we sample 50 failed queries and categorize them:
- Retrieval missed relevant docs (30%)
- Model misinterpreted intent (25%)
- Missing data in sources (20%)
- Model hallucinated (15%)
- Other (10%)
This drives our improvement roadmap.
Query Categorization: Understanding Usage Patterns
We tag every query with multiple dimensions:
{
"query": "What's our exposure to tech sector ETFs?",
"intent": "risk_exposure",
"asset_class": "equity",
"time_sensitivity": "current",
"complexity": "medium",
"user_role": "portfolio_manager",
"source_preference": "positions_first"
}
What this enables:
- Usage analytics: Which user roles ask which questions?
- Performance segmentation: Is latency higher for complex queries?
- Retrieval optimization: Different query types need different source weighting
- Training data generation: Real queries become evaluation examples
Continuous Model Improvement: The Feedback Loop
LangSmith's trace data becomes our training data. Here's the loop:
Step 1: Identify improvement opportunities
- Queries with low confidence scores
- Queries where users gave negative feedback
- Queries that required human escalation
Step 2: Create evaluation datasets
# From production traces to test cases
evaluation_dataset = [
{
"query": failed_query,
"expected_sources": ["sec_filings", "earnings_transcripts"],
"expected_entities": ["AAPL", "Q3 2024"],
"ideal_response_summary": "Revenue growth with segment breakdown"
}
for failed_query in last_week_failures
]
Step 3: Run offline evaluations
- Test prompt variations
- Test retrieval parameter changes
- Test different chunking strategies
Step 4: A/B test in production
- 5% traffic to new configuration
- Compare metrics side-by-side
- Roll out if improvements hold
The Evaluation Framework That Keeps Us Honest
With great power comes great responsibility—especially in finance. Our evaluation framework has four layers:
Layer 1: Unit Tests for Deterministic Components
def test_date_extraction():
query = "What was our P&L on March 15, 2024?"
entities = extract_entities(query)
assert entities["date"] == "2024-03-15"
def test_ticker_recognition():
query = "How's BRK.B performing?"
tickers = extract_tickers(query)
assert "BRK.B" in tickers
Layer 2: Retrieval Quality Metrics
from ragas.metrics import context_precision, context_recall
def evaluate_retrieval(test_case):
retrieved = retrieve_documents(test_case.query)
precision = context_precision(
retrieved=retrieved,
expected=test_case.expected_docs
)
recall = context_recall(
retrieved=retrieved,
expected=test_case.expected_docs
)
assert precision > 0.8, f"Precision {precision} below threshold"
assert recall > 0.7, f"Recall {recall} below threshold"
Layer 3: Generation Quality (LLM-as-Judge)
FINANCIAL_EVALUATION_PROMPT = """
You are evaluating a financial assistant's response. Score each dimension 1-5:
Accuracy (1-5): Are all numerical claims correct? Are dates and entities right?
Completeness (1-5): Does it address all parts of the query?
Source Attribution (1-5): Are claims traceable to provided sources?
Risk Awareness (1-5): Does it appropriately qualify uncertain information?
Conciseness (1-5): Is it clear without unnecessary detail?
Query: {query}
Response: {response}
Sources: {sources}
Return JSON with scores and brief justification.
"""
Layer 4: Safety and Compliance Checks
Financial responses have non-negotiable requirements:
def safety_check(response):
checks = {
"forward_looking_statements": not contains_speculative_future(response),
"regulated_terms": not uses_restricted_phrases(response),
"disclaimers_present": has_required_disclaimers(response),
"hallucination_check": all_claims_sourced(response)
}
return all(checks.values())
The 92% Resolution Rate: What It Actually Means
Let's be precise about what "92% resolution rate" means in practice.
The breakdown of resolved queries:
- Full resolution (78%) : Query answered completely, no follow-up needed
- Partial resolution (14%) : Answer provided but user needed to clarify or ask follow-up
- Escalation (5%) : Handed to human after bot attempt
- Failed (3%) : Bot couldn't handle, direct to human
What drives improvements:
- Each 1% gain in resolution required ~200 new evaluation cases
- The hardest gains come from edge cases (unusual ticker formats, complex multi-part queries)
- We track "resolution by category" to focus efforts
Lessons Learned: What We'd Do Differently
What Worked
Multi-source from day one
Building with multiple knowledge sources forced us to think about source selection and weighting early. Retrofitting would have been painful.
LangSmith instrumentation before launch
We had tracing from the first prototype. This meant when we launched, we immediately had baseline data.
User feedback as first-class metric
Thumbs up/down isn't just a nice-to-have—it's our most valuable signal. We treat every downvote as a bug report.
Sub-second obsession
Financial users won't wait. Optimizing for latency forced better architecture decisions.
What We'd Change
More evaluation data earlier
We started with 50 test cases. We needed 500. Build your evaluation dataset before you think you need it.
Stricter hallucination detection
Our initial monitoring missed the stale data incident. Now we check every numerical claim against source timestamps.
Earlier A/B testing infrastructure
We waited too long to implement A/B tests. Now we test every significant change against 5-10% of traffic.
Regulatory review integration
Compliance should be in the loop from day one, not after incidents.
The Road Ahead
Our evaluation framework evolves continuously. Current priorities:
Real-time hallucination detection
Using smaller models to verify each claim against sources before returning to user.
Multi-lingual expansion
Evaluating performance across languages without losing financial accuracy.
Reasoning transparency
Helping users understand why the bot answered the way it did, with visible chain-of-thought.
Automated test generation
Using production traces to automatically create new evaluation cases.
The Bottom Line
Building a production financial RAG system isn't about having the biggest model or the cleverest prompts. It's about:
- Measuring everything (latency, accuracy, source usage, failure modes)
- Learning systematically (every failure becomes an evaluation case)
- Improving continuously (A/B tests, not guesswork)
- Staying honest (knowing what you don't know)
Our 92% resolution rate isn't a finish line—it's a baseline. Every week we find new edge cases, new failure modes, new opportunities to improve.
And that's the real lesson: LLM evaluation isn't a one-time activity. It's the discipline of getting better every day.
Building a financial RAG system? I'd love to hear about your evaluation challenges. What metrics matter most to you? What's broken in surprising ways? Let me know.
Top comments (0)