DEV Community

Cover image for Beyond Customer Support: Building Production-Grade Financial RAG Systems
ruchika bhat
ruchika bhat

Posted on

Beyond Customer Support: Building Production-Grade Financial RAG Systems

The Day Our Financial Chatbot Almost Cost a Client 1,00,000

Six months into production, our financial RAG chatbot faced its first real crisis. A hedge fund client asked: "What's our exposure to tech sector derivatives as of last Friday's close?"

The bot responded instantly—with data from two weeks ago.

No error. No warning. Just confidently wrong information that could have triggered a bad trading decision.

We caught it during an internal audit before the client acted on it. But that moment changed everything about how we think about LLM evaluation.

This is the story of building, scaling, and continuously improving a financial RAG system that now handles 10,000+ monthly queries with 92% resolution rate and sub-second responses. And more importantly, how we evaluate it to ensure it never makes that mistake again.


The Architecture: Multi-Source Financial RAG

Financial queries are uniquely challenging. They require:

  • Real-time accuracy (stock prices, market movements)
  • Historical context (past performance, trends)
  • Regulatory knowledge (compliance requirements)
  • Document understanding (earnings reports, SEC filings)

Our LangChain-based architecture integrates all four:

User Query
    ↓
[Query Understanding Layer]
    → Intent classification (pricing, analysis, compliance, reporting)
    → Entity extraction (tickers, dates, document types)
    ↓
[Multi-Source Retrieval]
    → Real-time market data API (current prices, volumes)
    → Vector store (historical reports, earnings calls)
    → Structured database (client positions, exposure limits)
    → Regulatory corpus (SEC rules, compliance guidelines)
    ↓
[Context Assembly & Ranking]
    ↓
[Generation with Source Attribution]
    ↓
Response with Citations
Enter fullscreen mode Exit fullscreen mode

The magic—and the risk—is in how these sources are weighted and combined. A query about "Apple's revenue growth" needs both current stock data and historical financial statements. A compliance question needs regulatory documents first, market data second.


Scaling to 10,000+ Monthly Queries

Going from prototype to production at scale required solving three distinct challenges:

Challenge 1: Latency at Scale

Financial users expect instant responses. Sub-second isn't a nice-to-have—it's table stakes.

What we optimized:

  • Parallel retrieval: All three sources queried simultaneously
  • Chunking strategy: Variable chunk sizes based on document type (small for news, larger for reports)
  • Caching layer: Frequent queries (like "current price of $SPY") served from cache with TTL validation
  • Streaming responses: First token under 200ms, full response under 1 second

Result: 99.9% of queries under 800ms at peak load.

Challenge 2: Resolution Rate Engineering

Hitting 92% resolution rate wasn't accidental—it was engineered through systematic iteration.

The funnel approach:

Total Queries (100%)
    ↓
[Intent Recognition Failure] → 3% (escalate to human)
    ↓
[Retrieval Failure] → 2% (fallback to broader search)
    ↓
[Generation Failure] → 3% (rephrase, retry once)
    ↓
[Success] → 92% (resolved autonomously)
Enter fullscreen mode Exit fullscreen mode

Each failure category had specific remediations:

  • Intent failures: Expanded training data for edge cases
  • Retrieval failures: Added hybrid search (keyword + semantic) for better recall
  • Generation failures: Implemented self-verification step before returning

Challenge 3: Production Reliability

Financial systems can't go down. Period.

Our stack:

  • Load balancing: Multiple model endpoints with automatic failover
  • Rate limiting: Per-client quotas to prevent abuse
  • Graceful degradation: Fallback to simpler models when primary is overloaded
  • Automated recovery: Self-healing on common error patterns

The Monitoring Stack: LangSmith as Our Canary

This is where evaluation becomes inseparable from operations. LangSmith isn't just logging—it's our early warning system.

Performance Tracking: The Real-Time Dashboard

Every query generates a trace with critical metrics:

# Instrumentation example
from langsmith import traceable
from langchain.callbacks.tracers import LangSmithTracer

@traceable(run_type="chain", name="financial_rag")
def process_query(query: str, user_context: dict):
    # Start timer
    start = time.time()

    # Track retrieval stages
    market_data = retrieve_market_data(query)
    documents = retrieve_documents(query)
    positions = retrieve_positions(user_context)

    # Log latency per source
    trace.add_metadata({
        "market_data_latency_ms": market_data.latency,
        "documents_latency_ms": documents.latency,
        "total_retrieval_ms": (time.time() - start) * 1000
    })

    # Track source contributions
    trace.add_metadata({
        "sources_used": ["market_api", "vector_store", "structured_db"],
        "chunks_retrieved": len(documents.chunks)
    })

    return generate_response(market_data, documents, positions)
Enter fullscreen mode Exit fullscreen mode

What we monitor in real-time:

  • p95/p99 latency per query type
  • Retrieval success rate (did we find relevant documents?)
  • Source distribution (which knowledge sources are being used?)
  • Token usage per query (cost tracking)
  • Error rate by category

Error Analysis: Finding the Needles

When something fails, we need to know why—immediately.

Error categorization pipeline:

  1. Runtime exceptions → Alert on-call engineer
  2. Empty retrieval → Flag for retrieval tuning
  3. Low confidence generation → Log for offline analysis
  4. User feedback (thumbs down) → Prioritize for review

The weekly review ritual:
Every Monday, we sample 50 failed queries and categorize them:

  • Retrieval missed relevant docs (30%)
  • Model misinterpreted intent (25%)
  • Missing data in sources (20%)
  • Model hallucinated (15%)
  • Other (10%)

This drives our improvement roadmap.

Query Categorization: Understanding Usage Patterns

We tag every query with multiple dimensions:

{
  "query": "What's our exposure to tech sector ETFs?",
  "intent": "risk_exposure",
  "asset_class": "equity",
  "time_sensitivity": "current",
  "complexity": "medium",
  "user_role": "portfolio_manager",
  "source_preference": "positions_first"
}
Enter fullscreen mode Exit fullscreen mode

What this enables:

  • Usage analytics: Which user roles ask which questions?
  • Performance segmentation: Is latency higher for complex queries?
  • Retrieval optimization: Different query types need different source weighting
  • Training data generation: Real queries become evaluation examples

Continuous Model Improvement: The Feedback Loop

LangSmith's trace data becomes our training data. Here's the loop:

Step 1: Identify improvement opportunities

  • Queries with low confidence scores
  • Queries where users gave negative feedback
  • Queries that required human escalation

Step 2: Create evaluation datasets

# From production traces to test cases
evaluation_dataset = [
    {
        "query": failed_query,
        "expected_sources": ["sec_filings", "earnings_transcripts"],
        "expected_entities": ["AAPL", "Q3 2024"],
        "ideal_response_summary": "Revenue growth with segment breakdown"
    }
    for failed_query in last_week_failures
]
Enter fullscreen mode Exit fullscreen mode

Step 3: Run offline evaluations

  • Test prompt variations
  • Test retrieval parameter changes
  • Test different chunking strategies

Step 4: A/B test in production

  • 5% traffic to new configuration
  • Compare metrics side-by-side
  • Roll out if improvements hold

The Evaluation Framework That Keeps Us Honest

With great power comes great responsibility—especially in finance. Our evaluation framework has four layers:

Layer 1: Unit Tests for Deterministic Components

def test_date_extraction():
    query = "What was our P&L on March 15, 2024?"
    entities = extract_entities(query)
    assert entities["date"] == "2024-03-15"

def test_ticker_recognition():
    query = "How's BRK.B performing?"
    tickers = extract_tickers(query)
    assert "BRK.B" in tickers
Enter fullscreen mode Exit fullscreen mode

Layer 2: Retrieval Quality Metrics

from ragas.metrics import context_precision, context_recall

def evaluate_retrieval(test_case):
    retrieved = retrieve_documents(test_case.query)

    precision = context_precision(
        retrieved=retrieved,
        expected=test_case.expected_docs
    )

    recall = context_recall(
        retrieved=retrieved,
        expected=test_case.expected_docs
    )

    assert precision > 0.8, f"Precision {precision} below threshold"
    assert recall > 0.7, f"Recall {recall} below threshold"
Enter fullscreen mode Exit fullscreen mode

Layer 3: Generation Quality (LLM-as-Judge)

FINANCIAL_EVALUATION_PROMPT = """
You are evaluating a financial assistant's response. Score each dimension 1-5:

Accuracy (1-5): Are all numerical claims correct? Are dates and entities right?
Completeness (1-5): Does it address all parts of the query?
Source Attribution (1-5): Are claims traceable to provided sources?
Risk Awareness (1-5): Does it appropriately qualify uncertain information?
Conciseness (1-5): Is it clear without unnecessary detail?

Query: {query}
Response: {response}
Sources: {sources}

Return JSON with scores and brief justification.
"""
Enter fullscreen mode Exit fullscreen mode

Layer 4: Safety and Compliance Checks

Financial responses have non-negotiable requirements:

def safety_check(response):
    checks = {
        "forward_looking_statements": not contains_speculative_future(response),
        "regulated_terms": not uses_restricted_phrases(response),
        "disclaimers_present": has_required_disclaimers(response),
        "hallucination_check": all_claims_sourced(response)
    }
    return all(checks.values())
Enter fullscreen mode Exit fullscreen mode

The 92% Resolution Rate: What It Actually Means

Let's be precise about what "92% resolution rate" means in practice.

The breakdown of resolved queries:

  • Full resolution (78%) : Query answered completely, no follow-up needed
  • Partial resolution (14%) : Answer provided but user needed to clarify or ask follow-up
  • Escalation (5%) : Handed to human after bot attempt
  • Failed (3%) : Bot couldn't handle, direct to human

What drives improvements:

  • Each 1% gain in resolution required ~200 new evaluation cases
  • The hardest gains come from edge cases (unusual ticker formats, complex multi-part queries)
  • We track "resolution by category" to focus efforts

Lessons Learned: What We'd Do Differently

What Worked

Multi-source from day one

Building with multiple knowledge sources forced us to think about source selection and weighting early. Retrofitting would have been painful.

LangSmith instrumentation before launch

We had tracing from the first prototype. This meant when we launched, we immediately had baseline data.

User feedback as first-class metric

Thumbs up/down isn't just a nice-to-have—it's our most valuable signal. We treat every downvote as a bug report.

Sub-second obsession

Financial users won't wait. Optimizing for latency forced better architecture decisions.

What We'd Change

More evaluation data earlier

We started with 50 test cases. We needed 500. Build your evaluation dataset before you think you need it.

Stricter hallucination detection

Our initial monitoring missed the stale data incident. Now we check every numerical claim against source timestamps.

Earlier A/B testing infrastructure

We waited too long to implement A/B tests. Now we test every significant change against 5-10% of traffic.

Regulatory review integration

Compliance should be in the loop from day one, not after incidents.


The Road Ahead

Our evaluation framework evolves continuously. Current priorities:

Real-time hallucination detection

Using smaller models to verify each claim against sources before returning to user.

Multi-lingual expansion

Evaluating performance across languages without losing financial accuracy.

Reasoning transparency

Helping users understand why the bot answered the way it did, with visible chain-of-thought.

Automated test generation

Using production traces to automatically create new evaluation cases.


The Bottom Line

Building a production financial RAG system isn't about having the biggest model or the cleverest prompts. It's about:

  • Measuring everything (latency, accuracy, source usage, failure modes)
  • Learning systematically (every failure becomes an evaluation case)
  • Improving continuously (A/B tests, not guesswork)
  • Staying honest (knowing what you don't know)

Our 92% resolution rate isn't a finish line—it's a baseline. Every week we find new edge cases, new failure modes, new opportunities to improve.

And that's the real lesson: LLM evaluation isn't a one-time activity. It's the discipline of getting better every day.


Building a financial RAG system? I'd love to hear about your evaluation challenges. What metrics matter most to you? What's broken in surprising ways? Let me know.

Top comments (0)