Subal B

Posted on Feb 20

Hybrid RAG System over SEC Filings

#rag #llm

I Built a RAG System Over SEC Filings — Here's Every Design Decision I Made (And Why One-Size-Fits-All Retrieval Doesn't Work for Financial Data)

A deep dive into building a production RAG system for SEC filings — from the first naive prototype that hallucinated revenue numbers, through the domain-specific nightmares I didn't see coming, to the 5-route retrieval architecture that finally worked.

The Spark

It started with a simple question.

I was trying to compare Apple and Microsoft's revenue for 2023. Not a complicated question. Just: who made more money, and where did it come from?

Here's what that actually required:

Go to SEC EDGAR
Search for Apple's FY2023 10-K filing
Scroll through 80+ pages of legal boilerplate
Find the Consolidated Statements of Operations
Extract the revenue number
Scroll further to the segment footnotes — iPhone? Services? Mac?
Now repeat the entire process for Microsoft — different filing, different fiscal year end (Apple ends in September, Microsoft ends in June), different segment structure
Cross-check the numbers in the Management Discussion & Analysis section
Make sure the numbers actually tie to the XBRL data

That took me about 40 minutes. For one question.

And I thought — this data is public. It's structured. It's digital. Why can't I just ask the question and get an answer?

So I built a system that does exactly that. You type "Compare revenue of AAPL vs MSFT 2023 and what contributed the most to revenue for each company" and in about 5 seconds you get:

	Apple (AAPL)	Microsoft (MSFT)
Revenue	$383.29B	$211.91B
Top segment	iPhone — $200.58B (52%)	Cloud — $137.4B (65%)
#2 segment	Services — $85.20B (22%)	Office — $69.27B (33%)

With source citations, a confidence score, and contradiction detection — all streamed in real time.

What the system covers

Before diving into the how — here's the scope:

10 companies: AAPL, MSFT, NVDA, AMZN, GOOGL, META, BRK-B, LLY, AVGO, JPM (top S&P 500 by market cap)
2010 to present — 15+ years of filing history per company
Filing types: 10-K (annual) and 10-Q (quarterly)
~1.1M structured XBRL facts across annual_facts (361K rows) and quarterly_facts (734K rows)
~134K narrative chunks with embeddings across 10-K and 10-Q sections (Risk Factors, MD&A)
~1,800 full financial statements (income statement, balance sheet, cash flow) stored as markdown
Database size: ~3 GB in PostgreSQL + pgvector

This post is the full story of how I got there. Every design decision, every dead end, every domain-specific surprise. If you're building RAG systems — especially over structured + unstructured data — I hope this saves you some pain.

Phase 1: The Naive Approach (and Why It Failed)

Like most people building RAG systems in 2024, I started with the obvious architecture:

Download SEC filings
Chunk the text
Generate embeddings
Store in a vector database
On query: embed the question, find similar chunks, feed to LLM, get answer

This is the "hello world" of RAG. Every tutorial teaches this pattern. And for some questions, it works.

"What are Meta's key risk factors?" — vector search finds the Risk Factors section, pulls relevant paragraphs, the LLM summarizes them. Good enough.

Then someone asks: "What was Apple's revenue in 2023?"

The system pulls a paragraph from the MD&A section that mentions revenue in passing: "...our total net revenue was $383.3 billion, a decrease of 3 percent year-over-year..." The LLM extracts the number. Sometimes it gets it right. Sometimes it grabs a quarterly number instead of annual. Sometimes it hallucinates entirely.

The problem is fundamental: I'm using semantic similarity to find an exact number. That's like using Google Maps to find your car keys. The tool isn't wrong — it's just the wrong tool for the job.

The actual revenue figure was sitting in structured XBRL data the entire time — a machine-readable tag that says us-gaap:Revenues = 383,285,000,000 USD. No ambiguity. No extraction needed. I just wasn't using it.

The Aha Moment

SEC filings contain two fundamentally different types of information:

Type	Examples	How you should retrieve it
Structured data	Revenue, net income, EPS, total assets	SQL query over parsed XBRL facts
Unstructured narrative	Risk factors, management discussion, business strategy	Vector similarity search + reranking

And many real questions need both at once: "Compare AAPL vs MSFT revenue and explain what drove the difference" needs the revenue numbers from XBRL and the management commentary from MD&A.

One retrieval pipeline can't do all of this well. That realization reshaped everything.

Phase 2: Building the Data Foundation

Before any RAG can work, you need data. And SEC filing data is... a journey.

Ingesting from SEC EDGAR

SEC EDGAR is the authoritative source for all public company filings in the US. It's also a product of the early 2000s web. The data quality issues I encountered:

Inconsistent filing formats: Filings span decades of format changes. Older filings are HTML with embedded formatting. Newer ones use Inline XBRL. There's no single parsing strategy that works across all years.
Missing XBRL tags: Some filings have incomplete XBRL tagging. A company might report revenue in their financial statements but not tag it properly, meaning programmatic extraction fails silently. You don't find out until a user asks a question and gets nothing back.
Rate limiting: SEC EDGAR enforces strict rate limits (10 requests/second with a declared User-Agent). I use a 0.15-second delay between API calls, which means ingesting 15+ years of filings across 10 tickers takes hours.
Filing amendments: Companies file amendments (10-K/A, 10-Q/A) that supersede original filings. Without deduplication by accession number, you get double-counted data.

The Custom Parser Disaster

My first attempt was a custom XBRL parser. It read SEC EDGAR's JSON endpoints and extracted financial facts. It worked for simple filings but quickly fell apart:

The period classification problem: A single 10-K filing contains annual facts, quarterly comparatives, and year-to-date (YTD) aggregations, all interleaved. My parser couldn't reliably sort these into the correct buckets. A "FY2023 revenue" query might return the YTD-9-month figure instead of the full-year figure. Same concept, same filing, different time period — and the wrong one.

I solved this with duration-based classification:

def classify_fact_period(fact):
    duration_days = (end_date - start_date).days
    if 70 <= duration_days <= 120:      # ~3 months → quarterly
        return "quarterly", infer_quarter()
    elif 330 <= duration_days <= 420:    # ~12 months → annual
        return "annual", None
    else:                                # 6-month, 9-month YTD → discard
        return None, None                # Deliberately excluded

Facts with durations between 120 and 330 days (YTD periods) are dropped entirely. This single filter eliminated an entire class of silent data corruption.

The missing data problem: The parser would silently drop facts it couldn't classify, leading to gaps that only surfaced when users asked questions.

I eventually switched to edgartools, an open-source library built specifically for SEC EDGAR. It handled period classification, quarterly/annual bucketing, and section extraction out of the box. Not a silver bullet — it still struggled with certain filings — but it cut my ingestion bugs by 80%.

The Database Decision

I needed a database that could handle both structured financial data (XBRL facts) and vector embeddings for narrative search. The options:

Option	Pros	Cons
Pinecone + PostgreSQL	Best-in-class vector search	Two databases, cross-system joins, higher cost
SQLite + FAISS	Simple, local	No concurrent access, no full-text search, hard to deploy
PostgreSQL + pgvector	Single database for everything	Vector search not as fast as dedicated solutions

I went with PostgreSQL + pgvector. The deciding factor: ~60% of queries in my system are relational (XBRL lookups, timeseries), not vector search. Running two databases for the 40% that needs embeddings wasn't worth the operational complexity.

Plus, at ~134K embeddings, pgvector's IVFFlat indexes are perfectly adequate. This isn't a million-document corpus — it's a focused, well-structured dataset.

Schema Design: 8 Tables, Each with a Purpose

filings              → Central metadata (accession numbers, dates)
annual_facts         → XBRL from 10-K filings (~361K rows)
quarterly_facts      → XBRL from 10-Q filings (~734K rows)
sections_10k         → 10-K narrative chunks + embeddings (~42K rows)
sections_10q         → 10-Q narrative chunks + embeddings (~92K rows)
financial_documents  → Full statements as markdown (~1.8K rows)
filing_sections      → Raw extracted section text
earnings_reports     → 8-K structured earnings data

Why separate annual and quarterly tables? Performance. annual_facts has 361K rows and quarterly_facts has 734K rows. Annual queries (the majority) would waste time scanning quarterly rows behind a WHERE fiscal_quarter IS NULL filter. Separate tables with dedicated indexes give cleaner query plans.

Why separate 10-K and 10-Q vector tables? The section structures are fundamentally different (10-K has Items 1-15, 10-Q has Parts I-II with different items), they have different temporal columns, and different uniqueness constraints.

One design choice I'm particularly happy with: partial indexes for consolidated data:

CREATE INDEX idx_annual_consolidated
  ON annual_facts(ticker, concept, fiscal_year)
  INCLUDE (value, unit)
  WHERE dimension IS NULL;

XBRL reports the same metric multiple times — consolidated total, by geographic segment, by product line. Without filtering, a revenue query returns 5-10 values instead of one. This partial index enforces "consolidated only" at the index level and enables index-only scans for the most common query pattern.

Phase 3: The Domain-Specific Nightmares

This is where the project got really interesting. And by "interesting," I mean "spent weeks debugging things that no tutorial prepared me for."

Nightmare #1: Fiscal Years Don't Follow the Calendar

Only 4 of my 10 companies use a December fiscal year-end. The others:

Ticker	FY End	What "2024" Actually Means
AAPL	September	Oct 2023 – Sep 2024
MSFT	June	Jul 2023 – Jun 2024
NVDA	January	Feb 2024 – Jan 2025
AVGO	October	Nov 2023 – Oct 2024

When a user asks "What was NVIDIA's revenue in 2024?", they probably mean calendar year 2024. But in NVIDIA's filing terminology, that's FY2025 (ending January 2025). If you query WHERE fiscal_year = 2024, you get FY2024 — which is mostly calendar year 2023.

I built a calendar_to_fiscal_year() mapping, but the tricky part is that the LLM classifier doesn't always know about these mappings. The system prompt tells it about fiscal year conventions, but the real safeguard is the programmatic mapping that runs regardless of what the LLM outputs.

This took me longer to debug than I'd like to admit. I kept getting NVIDIA numbers that were "off by one year" and couldn't figure out why until I sat down with a calendar and traced through the fiscal year logic manually.

Nightmare #2: 52/53-Week Fiscal Calendars

Apple and Broadcom don't use exact month-end dates. They use 52/53-week fiscal calendars, which means their quarter endings can spill 1-7 days into the next calendar month.

Apple's Q1 FY2024 ended on December 30, 2023 — not December 31. But their Q4 FY2023 ended on September 30, 2023. And sometimes a quarter ends on October 1 instead of September 30.

This created quarter-misclassification bugs: a period ending January 2 would be classified as Q2 instead of Q1 because the raw month was January.

The fix is a humble 2-line heuristic:

def infer_fiscal_quarter(ticker, filing):
    period = filing.period_of_report
    month = period.month
    # Handle 52/53-week calendars: if period ends in first 7 days,
    # roll back to previous month
    if period.day <= 7:
        month = 12 if month == 1 else month - 1

This correctly handles the spillover that affects ~5% of Apple's and Broadcom's filings. Two lines of code, weeks of debugging to discover I needed them.

Nightmare #3: Q4 Doesn't Exist

This one genuinely surprised me. The SEC does not require companies to file a separate 10-Q for Q4. Q4 data is embedded in the annual 10-K filing. So if you query quarterly_facts WHERE fiscal_quarter = 4, you often get nothing.

I needed three complementary approaches:

Arithmetic derivation: Q4 = Annual - Q1 - Q2 - Q3. Simple, but requires all three quarterly filings to be present.
Income statement vs. balance sheet distinction: This is the subtle one. Revenue is a flow metric — it accumulates over time, so Q4 = Annual minus the first three quarters. But total assets is a stock metric — it's a point-in-time snapshot. For balance sheet items, Q4 = the annual period-end value, not a subtraction.

# Income Statement (flow): Q4 = Annual - Q1 - Q2 - Q3
# Balance Sheet (stock):   Q4 = Annual value (point-in-time)
# Cash Flow (flow):        Q4 = Annual - Q3 YTD

edgartools TTM calculator: For edge cases where the arithmetic approach fails (missing quarters), the library has built-in quarterization logic.

I didn't find any of this in a tutorial. I found it by getting wrong Q4 numbers and tracing through the data until I understood why.

Nightmare #4: XBRL Tags Get Renamed

The XBRL taxonomy evolves. What was us-gaap:SalesRevenueNet in 2015 became us-gaap:Revenues in 2018 and then us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax (yes, that's one tag name) after ASC 606 adoption.

A timeseries query for "Apple's revenue 2015-2024" must search across all three concept names. If you only search for us-gaap:Revenues, you'll get data for 2018-2020 but miss everything before and after.

My solution is a concept alias system:

CONCEPT_ALIASES = {
    "revenue": [
        "us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax",
        "us-gaap:Revenues",
        "us-gaap:SalesRevenueNet",
        "us-gaap:SalesRevenueGoodsNet",
    ],
}

The retrieval functions try each alias in order and take the first match per (ticker, year) pair. The get_metric_timeseries() function uses DISTINCT ON (fiscal_year, fiscal_quarter) with ordering by end_date DESC to pick the most relevant value when multiple concepts match.

But here's the really sneaky part — I discovered this bug just recently. The LLM classifier would return xbrl_concepts: ["us-gaap:Revenues"] for a multi-company query. NVIDIA uses us-gaap:Revenues, so it worked fine. But Microsoft uses us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax. The single concept from the classifier found NVIDIA's data but missed Microsoft's entirely, causing Microsoft to fall back to raw income statement text — which the LLM then misinterpreted, producing wrong numbers.

The fix was to expand the classifier's single concept with the full alias list:

# After resolving the classifier's concept, expand with all known aliases
for term in resolved.keys():
    if term.lower() in CONCEPT_ALIASES:
        existing = set(resolved[term])
        for alias in CONCEPT_ALIASES[term.lower()]:
            if alias not in existing:
                resolved[term].append(alias)

This was a production bug that produced confident, wrong answers. The kind that erodes trust. And it came from the interaction between two correct components — the classifier was right to suggest us-gaap:Revenues, and the retrieval was right to use whatever concept it was given. The bug was in the assumption that one concept name works across all companies.

Nightmare #5: Consolidated vs. Segment Data

XBRL filings report the same metric multiple times with different dimensional breakdowns. Apple's 10-K contains:

Total revenue: $383B (consolidated)
iPhone revenue: $200B (product segment)
Services revenue: $85B (product segment)
Americas revenue: $162B (geographic segment)
Europe revenue: $94B (geographic segment)

All tagged as "revenue" variants in XBRL. Without filtering, a revenue query returns all of these. The LLM then has to figure out which one is the "real" revenue, and it often picks wrong.

The default behavior filters on WHERE dimension IS NULL (consolidated only). Segment data is opt-in via an include_segments=True flag, triggered when the query contains keywords like "by segment," "by product," "by region."

Phase 4: The 5-Route Retrieval Architecture

After hitting all these domain problems, the architecture crystallized into something I couldn't have designed upfront. Each route exists because a specific class of questions couldn't be served well by any other route.

Route 1: `metric_lookup` — Direct XBRL Queries

For: "What was Apple's revenue in 2023?"

This is the workhorse. Direct SQL query against annual_facts or quarterly_facts. No embeddings, no LLM retrieval — just a database lookup.

Classify → Extract ticker, year, concept
         → SQL: SELECT value FROM annual_facts
                WHERE ticker='AAPL' AND concept='us-gaap:Revenues'
                AND fiscal_year=2023 AND dimension IS NULL
         → Compute YoY comparison
         → Format and generate answer

Response time: sub-100ms for retrieval. Total with LLM: ~3 seconds.

Route 2: `timeseries` — Multi-Year Trends

For: "Show NVIDIA revenue from 2020 to 2024"

Uses get_metric_timeseries() which handles concept alias merging across years. Returns all data points chronologically, with derived Q4 values where needed.

Route 3: `full_statement` — Complete Financial Tables

For: "Show me JPMorgan's balance sheet for 2023"

Fetches pre-formatted markdown tables from financial_documents. These are stored as-is during ingestion — rendering a financial table is better done once during ingestion than on every query.

Route 4: `narrative` — Semantic Search

For: "What are Meta's key risk factors?"

This is the classic RAG pipeline: embed the query, search pgvector, rerank with a cross-encoder, return the top chunks.

But with important enhancements:

Section-aware chunking: Instead of naive 512-token chunks, the system detects subsection headings within the Risk Factors and MD&A sections, treating each subsection as a chunk candidate. Split points snap to sentence boundaries, never mid-sentence.
Cross-encoder reranking: The initial pgvector search returns ~100 candidates. These are reranked using cross-encoder/ms-marco-MiniLM-L-6-v2, which scores each (query, chunk) pair directly. This is critical for financial text — a query about "revenue growth drivers" has high cosine similarity with "revenue recognition policies" (shares the word "revenue") but low actual relevance.
Risk-specific sub-queries: For risk analysis queries, the system generates additional search queries targeting specific risk categories (litigation, debt, competition, cybersecurity) to cast a wider net.

Route 5: `hybrid` — Structured + Narrative

For: "Compare AAPL vs MSFT revenue and explain what drove the difference"

This is the most complex route. It combines relational retrieval (XBRL numbers) with narrative retrieval (MD&A text) and presents both to the LLM.

Classify → Determine retrieval_intent
         → If "timeseries": retrieve_timeseries()
         → If "full_statement": retrieve_full_statement()
         → Else: retrieve_metric_lookup()
         → ALSO: retrieve_narrative() for context
         → Merge relational + vector results
         → Format with separate "Structured Data" and "Narrative" sections
         → Generate answer with instructions to use numbers from XBRL

The key insight: the LLM is explicitly told to use XBRL data for hard numbers and narrative sections for qualitative context. Without this instruction, the LLM would sometimes extract numbers from narrative paragraphs instead of the authoritative XBRL data — producing slightly different values due to rounding in the prose.

The Query Classifier

Routing is done by GPT-4o-mini with function calling. The function schema extracts 15+ fields from a natural language query:

Route (metric_lookup, timeseries, full_statement, narrative, hybrid)
Tickers
Fiscal year(s)
XBRL concepts
Retrieval intent
Temporal granularity (annual/quarterly/specific_quarter)
Target sections
Search queries for vector retrieval

Why function calling instead of prompt-based classification? Structured output. A prompt-based approach ("respond with JSON") frequently produced malformed output for complex queries. Function calling returns typed fields — no parsing failures.

Why GPT-4o-mini instead of GPT-4o? Classification is pattern matching, not deep reasoning. GPT-4o-mini handles it at 10x lower cost with comparable accuracy. The savings compound — classification runs on every query.

Post-Classification Safety Nets

The classifier is right ~95% of the time. The other 5% cause the worst user experiences. Two heuristics catch the common misroutes:

Risk queries misrouted to metric_lookup: When the LLM sees "Apple" and defaults to financial data, even though the query is about risk factors:

if route in ("metric_lookup", "timeseries") \
   and re.search(r'\brisk factors?\b', query, re.IGNORECASE) \
   and not classification.get("xbrl_concepts"):
    classification["route"] = "narrative"

Quantitative queries misrouted to narrative: When the LLM routes to narrative but its own metadata includes XBRL concepts and a quantitative intent:

if route == "narrative" \
   and classification.get("xbrl_concepts") \
   and intent in ("specific_metric", "comparison"):
    classification["route"] = "metric_lookup"

These are band-aids. The LLM should get these right. But in practice, having deterministic safety nets for known failure patterns is cheaper than perfect classification.

Phase 5: Making Answers Trustworthy

Getting the right data is half the battle. The other half is making the LLM produce accurate, verifiable answers from that data.

Context Formatting: Route-Specific

The context string sent to the LLM is formatted differently for each route:

Relational routes get structured, labeled data:

## XBRL Financial Facts
- **us-gaap:Revenues** (2023): 383.29 billion USD
  [XBRL: us-gaap:Revenues | AAPL, FY 2023]

## Year-over-Year Comparison
- AAPL revenue FY2022→FY2023: 394.33B → 383.29B (-2.8%)

Narrative routes get section chunks with metadata headers:

### Chunk 1 [10-K] META FY2025 - Risk Factors (score: 1.780)
[Source: 10-K, Ticker: META, Year: 2025, Section: Risk Factors]
Certain factors may have a material adverse effect on our business...

Hybrid routes get both, separated by headers. The system prompt explicitly says: "Use XBRL data for top-line metrics. Use narrative sections for qualitative breakdowns."

The Financial Snapshot

For risk analysis and comparative queries, a "financial snapshot" is prepended to the context:

## META FY2025 — Financial Snapshot
- Revenue: $200.97B
- Net Income: $60.46B
- Total Assets: $366.02B
- Long-Term Debt: $58.74B
- Cash & Equivalents: $35.87B

This gives the LLM quantitative grounding when discussing risks. Without it, the LLM produces generic risk factor summaries. With it, the LLM can say "Meta's $58.74B in long-term debt creates refinancing risk..." — much more useful.

A fun bug I discovered here: the snapshot was accidentally pulling from the quarterly_facts table (default) instead of annual_facts. So NVIDIA's revenue showed as $2.22B (Q1 value) instead of $10.92B (annual total). The LLM confidently used the wrong number because it was labeled "Revenue" in the context. Changed one parameter — table="annual_facts" — and the numbers were correct. One parameter. Hours of debugging.

Contradiction Detection

This is one of my favorite features. The system cross-references narrative claims against XBRL data to catch inconsistencies.

If a narrative chunk says "revenue increased" but the XBRL YoY comparison shows a decrease greater than 2%, this is flagged:

if narrative_direction == "increase" and data_direction == "decrease":
    contradictions.append({
        "type": "direction_mismatch",
        "severity": "high" if abs(pct_change) > 10 else "medium",
        "detail": f"Narrative says revenue increased, but data shows {pct_change:+.1f}%",
    })

It also catches magnitude discrepancies — when the narrative claims "grew approximately 20%" but the actual XBRL data shows 5.3% growth.

Why does this matter? SEC filing narratives are written by management. They can contain optimistic framing. "Revenue remained strong" might accompany a 3% decline. Cross-referencing against hard numbers catches this.

All the keywords, thresholds, and watched concepts are configured in guardrails.yaml — no hardcoded values. The direction_tolerance_pct is 2% (small changes aren't contradictions), and the magnitude_tolerance_ppt is 5 percentage points (narrative approximations get some slack).

Confidence Scoring

Every answer gets a confidence score (0-100) from 5 weighted signals:

Signal	What It Measures	Weight
Retrieval Quality	Source reliability (XBRL=90, statements=72, narrative=varies by rerank score)	25-40%
Source Coverage	What fraction of requested (ticker, year, concept) tuples had data	20-25%
Cross-Source Agreement	Contradiction penalties (-25 per high severity, -15 per medium)	15-20%
Citation Density	Citations per sentence vs. target (0.5)	10-15%
Data Recency	Staleness penalty (-20 per year gap between requested and available data)	10-15%

The weights are route-specific: relational routes weight retrieval quality at 40% (high confidence in XBRL data), narrative routes weight it at 25% (inherently less precise), timeseries routes weight source coverage at 35% (incomplete data is the primary failure mode).

Tiers: High (75+, green), Moderate (50-74, yellow), Low (25-49, red).

Everything — weights, thresholds, tier boundaries, keywords — lives in guardrails.yaml. This was one of the best architectural decisions. Tuning confidence scoring doesn't require code changes. Just edit the YAML.

Citation System

Every factual claim in the answer includes an inline citation:

[Source: 10-K, Ticker: AAPL, Year: 2024, Section: Income Statement]

Post-generation, the API server enriches these with actual SEC EDGAR URLs by looking up accession numbers from the filings table. The frontend renders clickable links directly to the filing on sec.gov.

This sounds like a small detail, but it's the difference between "trust me" and "here's the source."

Phase 6: Multi-Company Queries (Harder Than You'd Think)

Comparing two companies isn't "run the single-company pipeline twice." I learned this the hard way.

Reranking Scores Aren't Comparable Across Tickers

When comparing Apple and Microsoft, narrative retrieval runs separate vector searches for each company. The cross-encoder reranking scores are query-dependent — Apple chunks are scored against the comparison query, and so are Microsoft chunks.

But these scores are not comparable across companies. One company's filing text might consistently score higher simply because of writing style or keyword density. Without intervention, Apple's chunks could dominate the results even if the user asked for an equal comparison.

The fix: per-ticker fair allocation in guardrails:

if is_multi_ticker:
    per_ticker_max = max(max_chunks // len(tickers), 3)
    for ticker, ticker_chunks in group_by_ticker(chunks).items():
        ticker_chunks.sort(key=lambda c: c["rerank_score"], reverse=True)
        kept.extend(ticker_chunks[:per_ticker_max])

Each company gets an equal share of the chunk budget, sorted by rerank score within each group.

Query Decomposition

Multi-ticker queries are decomposed into per-ticker sub-queries:

Parent: "Compare AAPL and MSFT revenue growth 2020-2024"
  → Sub-query 1: "AAPL revenue data 2020-2024"
  → Sub-query 2: "MSFT revenue data 2020-2024"

Sub-queries execute in parallel using ThreadPoolExecutor(max_workers=4). This was a significant performance improvement — sequential execution added 5-10 seconds of latency.

Fiscal Year Differences in the Answer

Apple's FY2023 ends in September 2023. Microsoft's FY2023 ends in June 2023. When the answer says "FY2023 revenue," these cover different calendar periods. The system prompt explicitly instructs the LLM to note fiscal year differences when comparing companies.

Phase 7: Caching, Cost, and Deployment

Three-Layer Cache

Layer	TTL	What's Cached
Query Results	1 hour	Full answer + sources + confidence
Classification	2 hours	Route + extracted metadata
Retrieval	1 hour	Raw retrieval data per (route + tickers + years + concepts)

The classification cache has a longer TTL because classification is deterministic for the same query. The retrieval cache keys are hashed from a composite of all retrieval parameters, so "Apple's revenue in 2024" and "AAPL revenue FY2024" — different query strings but identical retrieval parameters — share the same cache entry.

Everything uses Redis with graceful degradation — if Redis is unavailable, the system works without caching. No crashes, just slower.

Cost Tracking

Every query tracks its OpenAI API cost across all phases. A typical query costs $0.0003-$0.0007 — well under a penny. The frontend displays a cost card with per-phase breakdown, total cost, token counts, and an efficiency grade (S through C).

I added this for myself during development, but users love it. It builds trust — they can see that a simple metric lookup costs almost nothing, while a complex hybrid comparison costs a fraction of a cent more.

Deployment: Railway + Vercel

Vercel (frontend): Free tier for React, global CDN, auto-deploy from GitHub
Railway (backend + database): $5/month Hobby plan for FastAPI + PostgreSQL

The database was the interesting challenge. Railway's managed PostgreSQL doesn't include pgvector. I deployed a custom Docker image (pgvector/pgvector:pg17) as a Railway service. This required:

Custom volume mount (PGDATA=/data/postgres) because Railway's default creates a lost+found that conflicts with PostgreSQL's data directory
Manual TCP proxy for external access during migration
Reducing IVFFlat index lists from 100 to 50 because Railway's 64MB maintenance_work_mem wasn't enough for the larger index build

Data migration (3 GB) from local to Railway: pg_dump → pg_restore, with a version-matching headache (local pg17 vs system pg_dump v14) that took longer to debug than the actual migration.

Lessons Learned

1. Domain complexity dominates technical complexity. The hardest problems weren't about vector search or LLM prompting — they were about fiscal year conventions, XBRL taxonomy evolution, and Q4 derivation. A naive RAG system that ignores domain details produces confidently wrong answers.

2. Structured data should stay structured. I tried embedding financial tables early on. It was worse in every dimension: higher storage cost, slower retrieval, lower answer quality. Financial numbers should be queried relationally. Only narrative text benefits from semantic search.

3. LLM classification needs deterministic safety nets. 95% accuracy sounds high, but the 5% of misrouted queries cause the worst user experiences. Rule-based overrides for known failure patterns are cheap insurance.

4. Multi-entity queries are fundamentally different from single-entity. Reranking scores aren't comparable across tickers. Chunk budgets need fair allocation. Answer prompts need fiscal year difference warnings. You can't just loop.

5. One parameter can ruin everything. Using quarterly_facts instead of annual_facts for the financial snapshot (a default parameter I didn't override) produced wrong numbers that the LLM confidently presented. The data was technically correct — it was a Q1 revenue number. But it wasn't the right Q1 revenue number for an annual snapshot.

6. Config-driven guardrails enable iteration. Moving all thresholds, weights, and keywords into guardrails.yaml was one of the best decisions. Tuning confidence scoring or contradiction detection doesn't require code changes — just a YAML edit and a deploy.

Try It / Read the Code

Live demo: sec-intelligence-system.vercel.app

GitHub: github.com/bhattaraisubal-eng/sec-intelligence-system

Coverage: AAPL, MSFT, NVDA, AMZN, GOOGL, META, BRK-B, LLY, AVGO, JPM — 10-K and 10-Q filings from 2010 to present.

Stack: FastAPI, PostgreSQL + pgvector, OpenAI text-embedding-3-small, GPT-4o-mini, cross-encoder/ms-marco-MiniLM-L-6-v2, React + Tailwind CSS

Some queries to try:

"What was Apple's revenue in 2023?"
"How has NVIDIA revenue changed from 2020 to 2024?"
"Compare net income AAPL vs MSFT 2023"
"What are the key risk factors in Meta's latest 10-K?"
"Show JPMorgan balance sheet for 2023"

If you're building RAG systems over domain-specific data — especially data that mixes structured and unstructured information — I'd love to hear about your experience. The biggest thing I learned is that the domain shapes the architecture far more than the AI tools you choose.

If you found this useful, I'm happy to answer questions in the comments about any specific design decision. The detailed architecture docs in the repo go even deeper into each component.