<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Subal B</title>
    <description>The latest articles on DEV Community by Subal B (@bhattaraisubaleng).</description>
    <link>https://dev.to/bhattaraisubaleng</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3782444%2Fdbaebb53-b2e3-4ca4-83bb-cd9d1af14596.jpeg</url>
      <title>DEV Community: Subal B</title>
      <link>https://dev.to/bhattaraisubaleng</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bhattaraisubaleng"/>
    <language>en</language>
    <item>
      <title>Hybrid RAG System over SEC Filings</title>
      <dc:creator>Subal B</dc:creator>
      <pubDate>Fri, 20 Feb 2026 16:41:56 +0000</pubDate>
      <link>https://dev.to/bhattaraisubaleng/hybrid-rag-system-over-sec-filings-2gj8</link>
      <guid>https://dev.to/bhattaraisubaleng/hybrid-rag-system-over-sec-filings-2gj8</guid>
      <description>&lt;h2&gt;
  
  
  I Built a RAG System Over SEC Filings — Here's Every Design Decision I Made (And Why One-Size-Fits-All Retrieval Doesn't Work for Financial Data)
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A deep dive into building a production RAG system for SEC filings — from the first naive prototype that hallucinated revenue numbers, through the domain-specific nightmares I didn't see coming, to the 5-route retrieval architecture that finally worked.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Spark
&lt;/h2&gt;

&lt;p&gt;It started with a simple question.&lt;/p&gt;

&lt;p&gt;I was trying to compare Apple and Microsoft's revenue for 2023. Not a complicated question. Just: who made more money, and where did it come from?&lt;/p&gt;

&lt;p&gt;Here's what that actually required:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to SEC EDGAR&lt;/li&gt;
&lt;li&gt;Search for Apple's FY2023 10-K filing&lt;/li&gt;
&lt;li&gt;Scroll through 80+ pages of legal boilerplate&lt;/li&gt;
&lt;li&gt;Find the Consolidated Statements of Operations&lt;/li&gt;
&lt;li&gt;Extract the revenue number&lt;/li&gt;
&lt;li&gt;Scroll further to the segment footnotes — iPhone? Services? Mac?&lt;/li&gt;
&lt;li&gt;Now repeat the entire process for Microsoft — different filing, different fiscal year end (Apple ends in September, Microsoft ends in June), different segment structure&lt;/li&gt;
&lt;li&gt;Cross-check the numbers in the Management Discussion &amp;amp; Analysis section&lt;/li&gt;
&lt;li&gt;Make sure the numbers actually tie to the XBRL data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That took me about 40 minutes. For &lt;strong&gt;one question&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And I thought — this data is public. It's structured. It's digital. Why can't I just ask the question and get an answer?&lt;/p&gt;

&lt;p&gt;So I built a system that does exactly that. You type "Compare revenue of AAPL vs MSFT 2023 and what contributed the most to revenue for each company" and in about 5 seconds you get:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Apple (AAPL)&lt;/th&gt;
&lt;th&gt;Microsoft (MSFT)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Revenue&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$383.29B&lt;/td&gt;
&lt;td&gt;$211.91B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top segment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;iPhone — $200.58B (52%)&lt;/td&gt;
&lt;td&gt;Cloud — $137.4B (65%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;#2 segment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Services — $85.20B (22%)&lt;/td&gt;
&lt;td&gt;Office — $69.27B (33%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With source citations, a confidence score, and contradiction detection — all streamed in real time.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the system covers
&lt;/h3&gt;

&lt;p&gt;Before diving into the how — here's the scope:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;10 companies&lt;/strong&gt;: AAPL, MSFT, NVDA, AMZN, GOOGL, META, BRK-B, LLY, AVGO, JPM (top S&amp;amp;P 500 by market cap)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2010 to present&lt;/strong&gt; — 15+ years of filing history per company&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filing types&lt;/strong&gt;: 10-K (annual) and 10-Q (quarterly)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~1.1M structured XBRL facts&lt;/strong&gt; across &lt;code&gt;annual_facts&lt;/code&gt; (361K rows) and &lt;code&gt;quarterly_facts&lt;/code&gt; (734K rows)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~134K narrative chunks&lt;/strong&gt; with embeddings across 10-K and 10-Q sections (Risk Factors, MD&amp;amp;A)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~1,800 full financial statements&lt;/strong&gt; (income statement, balance sheet, cash flow) stored as markdown&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database size&lt;/strong&gt;: ~3 GB in PostgreSQL + pgvector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This post is the full story of how I got there. Every design decision, every dead end, every domain-specific surprise. If you're building RAG systems — especially over structured + unstructured data — I hope this saves you some pain.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 1: The Naive Approach (and Why It Failed)
&lt;/h2&gt;

&lt;p&gt;Like most people building RAG systems in 2024, I started with the obvious architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Download SEC filings&lt;/li&gt;
&lt;li&gt;Chunk the text&lt;/li&gt;
&lt;li&gt;Generate embeddings&lt;/li&gt;
&lt;li&gt;Store in a vector database&lt;/li&gt;
&lt;li&gt;On query: embed the question, find similar chunks, feed to LLM, get answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the "hello world" of RAG. Every tutorial teaches this pattern. And for some questions, it works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"What are Meta's key risk factors?"&lt;/strong&gt; — vector search finds the Risk Factors section, pulls relevant paragraphs, the LLM summarizes them. Good enough.&lt;/p&gt;

&lt;p&gt;Then someone asks: &lt;strong&gt;"What was Apple's revenue in 2023?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The system pulls a paragraph from the MD&amp;amp;A section that &lt;em&gt;mentions&lt;/em&gt; revenue in passing: "...our total net revenue was $383.3 billion, a decrease of 3 percent year-over-year..." The LLM extracts the number. Sometimes it gets it right. Sometimes it grabs a quarterly number instead of annual. Sometimes it hallucinates entirely.&lt;/p&gt;

&lt;p&gt;The problem is fundamental: &lt;strong&gt;I'm using semantic similarity to find an exact number.&lt;/strong&gt; That's like using Google Maps to find your car keys. The tool isn't wrong — it's just the wrong tool for the job.&lt;/p&gt;

&lt;p&gt;The actual revenue figure was sitting in structured XBRL data the entire time — a machine-readable tag that says &lt;code&gt;us-gaap:Revenues = 383,285,000,000 USD&lt;/code&gt;. No ambiguity. No extraction needed. I just wasn't using it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Aha Moment
&lt;/h3&gt;

&lt;p&gt;SEC filings contain &lt;strong&gt;two fundamentally different types of information&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;How you should retrieve it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Structured data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Revenue, net income, EPS, total assets&lt;/td&gt;
&lt;td&gt;SQL query over parsed XBRL facts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unstructured narrative&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Risk factors, management discussion, business strategy&lt;/td&gt;
&lt;td&gt;Vector similarity search + reranking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And many real questions need &lt;strong&gt;both at once&lt;/strong&gt;: "Compare AAPL vs MSFT revenue and explain what drove the difference" needs the revenue numbers from XBRL &lt;em&gt;and&lt;/em&gt; the management commentary from MD&amp;amp;A.&lt;/p&gt;

&lt;p&gt;One retrieval pipeline can't do all of this well. That realization reshaped everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2: Building the Data Foundation
&lt;/h2&gt;

&lt;p&gt;Before any RAG can work, you need data. And SEC filing data is... a journey.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingesting from SEC EDGAR
&lt;/h3&gt;

&lt;p&gt;SEC EDGAR is the authoritative source for all public company filings in the US. It's also a product of the early 2000s web. The data quality issues I encountered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inconsistent filing formats&lt;/strong&gt;: Filings span decades of format changes. Older filings are HTML with embedded formatting. Newer ones use Inline XBRL. There's no single parsing strategy that works across all years.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing XBRL tags&lt;/strong&gt;: Some filings have incomplete XBRL tagging. A company might report revenue in their financial statements but not tag it properly, meaning programmatic extraction fails silently. You don't find out until a user asks a question and gets nothing back.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting&lt;/strong&gt;: SEC EDGAR enforces strict rate limits (10 requests/second with a declared User-Agent). I use a 0.15-second delay between API calls, which means ingesting 15+ years of filings across 10 tickers takes hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filing amendments&lt;/strong&gt;: Companies file amendments (10-K/A, 10-Q/A) that supersede original filings. Without deduplication by accession number, you get double-counted data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Custom Parser Disaster
&lt;/h3&gt;

&lt;p&gt;My first attempt was a custom XBRL parser. It read SEC EDGAR's JSON endpoints and extracted financial facts. It worked for simple filings but quickly fell apart:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The period classification problem&lt;/strong&gt;: A single 10-K filing contains annual facts, quarterly comparatives, and year-to-date (YTD) aggregations, all interleaved. My parser couldn't reliably sort these into the correct buckets. A "FY2023 revenue" query might return the YTD-9-month figure instead of the full-year figure. Same concept, same filing, different time period — and the wrong one.&lt;/p&gt;

&lt;p&gt;I solved this with duration-based classification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_fact_period&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;duration_days&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end_date&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="mi"&gt;70&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;duration_days&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="c1"&gt;# ~3 months → quarterly
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quarterly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;infer_quarter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="mi"&gt;330&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;duration_days&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;420&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="c1"&gt;# ~12 months → annual
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;annual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                                &lt;span class="c1"&gt;# 6-month, 9-month YTD → discard
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;                &lt;span class="c1"&gt;# Deliberately excluded
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Facts with durations between 120 and 330 days (YTD periods) are dropped entirely. This single filter eliminated an entire class of silent data corruption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The missing data problem&lt;/strong&gt;: The parser would silently drop facts it couldn't classify, leading to gaps that only surfaced when users asked questions.&lt;/p&gt;

&lt;p&gt;I eventually switched to &lt;a href="https://github.com/dgunning/edgartools" rel="noopener noreferrer"&gt;edgartools&lt;/a&gt;, an open-source library built specifically for SEC EDGAR. It handled period classification, quarterly/annual bucketing, and section extraction out of the box. Not a silver bullet — it still struggled with certain filings — but it cut my ingestion bugs by 80%.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Database Decision
&lt;/h3&gt;

&lt;p&gt;I needed a database that could handle both structured financial data (XBRL facts) and vector embeddings for narrative search. The options:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option&lt;/th&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pinecone + PostgreSQL&lt;/td&gt;
&lt;td&gt;Best-in-class vector search&lt;/td&gt;
&lt;td&gt;Two databases, cross-system joins, higher cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQLite + FAISS&lt;/td&gt;
&lt;td&gt;Simple, local&lt;/td&gt;
&lt;td&gt;No concurrent access, no full-text search, hard to deploy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL + pgvector&lt;/td&gt;
&lt;td&gt;Single database for everything&lt;/td&gt;
&lt;td&gt;Vector search not as fast as dedicated solutions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I went with &lt;strong&gt;PostgreSQL + pgvector&lt;/strong&gt;. The deciding factor: ~60% of queries in my system are relational (XBRL lookups, timeseries), not vector search. Running two databases for the 40% that needs embeddings wasn't worth the operational complexity.&lt;/p&gt;

&lt;p&gt;Plus, at ~134K embeddings, pgvector's IVFFlat indexes are perfectly adequate. This isn't a million-document corpus — it's a focused, well-structured dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Schema Design: 8 Tables, Each with a Purpose
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;filings              → Central metadata (accession numbers, dates)
annual_facts         → XBRL from 10-K filings (~361K rows)
quarterly_facts      → XBRL from 10-Q filings (~734K rows)
sections_10k         → 10-K narrative chunks + embeddings (~42K rows)
sections_10q         → 10-Q narrative chunks + embeddings (~92K rows)
financial_documents  → Full statements as markdown (~1.8K rows)
filing_sections      → Raw extracted section text
earnings_reports     → 8-K structured earnings data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why separate annual and quarterly tables?&lt;/strong&gt; Performance. &lt;code&gt;annual_facts&lt;/code&gt; has 361K rows and &lt;code&gt;quarterly_facts&lt;/code&gt; has 734K rows. Annual queries (the majority) would waste time scanning quarterly rows behind a &lt;code&gt;WHERE fiscal_quarter IS NULL&lt;/code&gt; filter. Separate tables with dedicated indexes give cleaner query plans.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why separate 10-K and 10-Q vector tables?&lt;/strong&gt; The section structures are fundamentally different (10-K has Items 1-15, 10-Q has Parts I-II with different items), they have different temporal columns, and different uniqueness constraints.&lt;/p&gt;

&lt;p&gt;One design choice I'm particularly happy with: &lt;strong&gt;partial indexes for consolidated data&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_annual_consolidated&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;annual_facts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;concept&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fiscal_year&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;dimension&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;XBRL reports the same metric multiple times — consolidated total, by geographic segment, by product line. Without filtering, a revenue query returns 5-10 values instead of one. This partial index enforces "consolidated only" at the index level and enables index-only scans for the most common query pattern.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3: The Domain-Specific Nightmares
&lt;/h2&gt;

&lt;p&gt;This is where the project got really interesting. And by "interesting," I mean "spent weeks debugging things that no tutorial prepared me for."&lt;/p&gt;

&lt;h3&gt;
  
  
  Nightmare #1: Fiscal Years Don't Follow the Calendar
&lt;/h3&gt;

&lt;p&gt;Only 4 of my 10 companies use a December fiscal year-end. The others:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Ticker&lt;/th&gt;
&lt;th&gt;FY End&lt;/th&gt;
&lt;th&gt;What "2024" Actually Means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AAPL&lt;/td&gt;
&lt;td&gt;September&lt;/td&gt;
&lt;td&gt;Oct 2023 – Sep 2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MSFT&lt;/td&gt;
&lt;td&gt;June&lt;/td&gt;
&lt;td&gt;Jul 2023 – Jun 2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVDA&lt;/td&gt;
&lt;td&gt;January&lt;/td&gt;
&lt;td&gt;Feb 2024 – Jan 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AVGO&lt;/td&gt;
&lt;td&gt;October&lt;/td&gt;
&lt;td&gt;Nov 2023 – Oct 2024&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When a user asks "What was NVIDIA's revenue in 2024?", they probably mean calendar year 2024. But in NVIDIA's filing terminology, that's FY2025 (ending January 2025). If you query &lt;code&gt;WHERE fiscal_year = 2024&lt;/code&gt;, you get FY2024 — which is mostly calendar year 2023.&lt;/p&gt;

&lt;p&gt;I built a &lt;code&gt;calendar_to_fiscal_year()&lt;/code&gt; mapping, but the tricky part is that the LLM classifier doesn't always know about these mappings. The system prompt tells it about fiscal year conventions, but the real safeguard is the programmatic mapping that runs regardless of what the LLM outputs.&lt;/p&gt;

&lt;p&gt;This took me longer to debug than I'd like to admit. I kept getting NVIDIA numbers that were "off by one year" and couldn't figure out why until I sat down with a calendar and traced through the fiscal year logic manually.&lt;/p&gt;

&lt;h3&gt;
  
  
  Nightmare #2: 52/53-Week Fiscal Calendars
&lt;/h3&gt;

&lt;p&gt;Apple and Broadcom don't use exact month-end dates. They use 52/53-week fiscal calendars, which means their quarter endings can spill 1-7 days into the next calendar month.&lt;/p&gt;

&lt;p&gt;Apple's Q1 FY2024 ended on December 30, 2023 — not December 31. But their Q4 FY2023 ended on September 30, 2023. And sometimes a quarter ends on October 1 instead of September 30.&lt;/p&gt;

&lt;p&gt;This created quarter-misclassification bugs: a period ending January 2 would be classified as Q2 instead of Q1 because the raw month was January.&lt;/p&gt;

&lt;p&gt;The fix is a humble 2-line heuristic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;infer_fiscal_quarter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filing&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;period&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;period_of_report&lt;/span&gt;
    &lt;span class="n"&gt;month&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;period&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month&lt;/span&gt;
    &lt;span class="c1"&gt;# Handle 52/53-week calendars: if period ends in first 7 days,
&lt;/span&gt;    &lt;span class="c1"&gt;# roll back to previous month
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;period&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;day&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;month&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;month&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;month&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This correctly handles the spillover that affects ~5% of Apple's and Broadcom's filings. Two lines of code, weeks of debugging to discover I needed them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Nightmare #3: Q4 Doesn't Exist
&lt;/h3&gt;

&lt;p&gt;This one genuinely surprised me. The SEC does not require companies to file a separate 10-Q for Q4. Q4 data is embedded in the annual 10-K filing. So if you query &lt;code&gt;quarterly_facts WHERE fiscal_quarter = 4&lt;/code&gt;, you often get nothing.&lt;/p&gt;

&lt;p&gt;I needed three complementary approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Arithmetic derivation&lt;/strong&gt;: &lt;code&gt;Q4 = Annual - Q1 - Q2 - Q3&lt;/code&gt;. Simple, but requires all three quarterly filings to be present.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Income statement vs. balance sheet distinction&lt;/strong&gt;: This is the subtle one. Revenue is a &lt;em&gt;flow&lt;/em&gt; metric — it accumulates over time, so Q4 = Annual minus the first three quarters. But total assets is a &lt;em&gt;stock&lt;/em&gt; metric — it's a point-in-time snapshot. For balance sheet items, Q4 = the annual period-end value, not a subtraction.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Income Statement (flow): Q4 = Annual - Q1 - Q2 - Q3
# Balance Sheet (stock):   Q4 = Annual value (point-in-time)
# Cash Flow (flow):        Q4 = Annual - Q3 YTD
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;edgartools TTM calculator&lt;/strong&gt;: For edge cases where the arithmetic approach fails (missing quarters), the library has built-in quarterization logic.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I didn't find any of this in a tutorial. I found it by getting wrong Q4 numbers and tracing through the data until I understood why.&lt;/p&gt;

&lt;h3&gt;
  
  
  Nightmare #4: XBRL Tags Get Renamed
&lt;/h3&gt;

&lt;p&gt;The XBRL taxonomy evolves. What was &lt;code&gt;us-gaap:SalesRevenueNet&lt;/code&gt; in 2015 became &lt;code&gt;us-gaap:Revenues&lt;/code&gt; in 2018 and then &lt;code&gt;us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax&lt;/code&gt; (yes, that's one tag name) after ASC 606 adoption.&lt;/p&gt;

&lt;p&gt;A timeseries query for "Apple's revenue 2015-2024" must search across &lt;strong&gt;all three&lt;/strong&gt; concept names. If you only search for &lt;code&gt;us-gaap:Revenues&lt;/code&gt;, you'll get data for 2018-2020 but miss everything before and after.&lt;/p&gt;

&lt;p&gt;My solution is a concept alias system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CONCEPT_ALIASES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-gaap:Revenues&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-gaap:SalesRevenueNet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-gaap:SalesRevenueGoodsNet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The retrieval functions try each alias in order and take the first match per (ticker, year) pair. The &lt;code&gt;get_metric_timeseries()&lt;/code&gt; function uses &lt;code&gt;DISTINCT ON (fiscal_year, fiscal_quarter)&lt;/code&gt; with ordering by &lt;code&gt;end_date DESC&lt;/code&gt; to pick the most relevant value when multiple concepts match.&lt;/p&gt;

&lt;p&gt;But here's the really sneaky part — I discovered this bug just recently. The LLM classifier would return &lt;code&gt;xbrl_concepts: ["us-gaap:Revenues"]&lt;/code&gt; for a multi-company query. NVIDIA uses &lt;code&gt;us-gaap:Revenues&lt;/code&gt;, so it worked fine. But Microsoft uses &lt;code&gt;us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax&lt;/code&gt;. The single concept from the classifier found NVIDIA's data but missed Microsoft's entirely, causing Microsoft to fall back to raw income statement text — which the LLM then misinterpreted, producing wrong numbers.&lt;/p&gt;

&lt;p&gt;The fix was to expand the classifier's single concept with the full alias list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After resolving the classifier's concept, expand with all known aliases
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;term&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resolved&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;term&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;CONCEPT_ALIASES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resolved&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;term&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;alias&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;CONCEPT_ALIASES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;term&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()]:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;alias&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;resolved&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;term&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was a production bug that produced confident, wrong answers. The kind that erodes trust. And it came from the interaction between two correct components — the classifier was right to suggest &lt;code&gt;us-gaap:Revenues&lt;/code&gt;, and the retrieval was right to use whatever concept it was given. The bug was in the assumption that one concept name works across all companies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Nightmare #5: Consolidated vs. Segment Data
&lt;/h3&gt;

&lt;p&gt;XBRL filings report the same metric multiple times with different dimensional breakdowns. Apple's 10-K contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total revenue: $383B (consolidated)&lt;/li&gt;
&lt;li&gt;iPhone revenue: $200B (product segment)&lt;/li&gt;
&lt;li&gt;Services revenue: $85B (product segment)&lt;/li&gt;
&lt;li&gt;Americas revenue: $162B (geographic segment)&lt;/li&gt;
&lt;li&gt;Europe revenue: $94B (geographic segment)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All tagged as "revenue" variants in XBRL. Without filtering, a revenue query returns all of these. The LLM then has to figure out which one is the "real" revenue, and it often picks wrong.&lt;/p&gt;

&lt;p&gt;The default behavior filters on &lt;code&gt;WHERE dimension IS NULL&lt;/code&gt; (consolidated only). Segment data is opt-in via an &lt;code&gt;include_segments=True&lt;/code&gt; flag, triggered when the query contains keywords like "by segment," "by product," "by region."&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 4: The 5-Route Retrieval Architecture
&lt;/h2&gt;

&lt;p&gt;After hitting all these domain problems, the architecture crystallized into something I couldn't have designed upfront. Each route exists because a specific class of questions couldn't be served well by any other route.&lt;/p&gt;

&lt;h3&gt;
  
  
  Route 1: &lt;code&gt;metric_lookup&lt;/code&gt; — Direct XBRL Queries
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;For&lt;/strong&gt;: "What was Apple's revenue in 2023?"&lt;/p&gt;

&lt;p&gt;This is the workhorse. Direct SQL query against &lt;code&gt;annual_facts&lt;/code&gt; or &lt;code&gt;quarterly_facts&lt;/code&gt;. No embeddings, no LLM retrieval — just a database lookup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Classify → Extract ticker, year, concept
         → SQL: SELECT value FROM annual_facts
                WHERE ticker='AAPL' AND concept='us-gaap:Revenues'
                AND fiscal_year=2023 AND dimension IS NULL
         → Compute YoY comparison
         → Format and generate answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response time: sub-100ms for retrieval. Total with LLM: ~3 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Route 2: &lt;code&gt;timeseries&lt;/code&gt; — Multi-Year Trends
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;For&lt;/strong&gt;: "Show NVIDIA revenue from 2020 to 2024"&lt;/p&gt;

&lt;p&gt;Uses &lt;code&gt;get_metric_timeseries()&lt;/code&gt; which handles concept alias merging across years. Returns all data points chronologically, with derived Q4 values where needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Route 3: &lt;code&gt;full_statement&lt;/code&gt; — Complete Financial Tables
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;For&lt;/strong&gt;: "Show me JPMorgan's balance sheet for 2023"&lt;/p&gt;

&lt;p&gt;Fetches pre-formatted markdown tables from &lt;code&gt;financial_documents&lt;/code&gt;. These are stored as-is during ingestion — rendering a financial table is better done once during ingestion than on every query.&lt;/p&gt;

&lt;h3&gt;
  
  
  Route 4: &lt;code&gt;narrative&lt;/code&gt; — Semantic Search
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;For&lt;/strong&gt;: "What are Meta's key risk factors?"&lt;/p&gt;

&lt;p&gt;This is the classic RAG pipeline: embed the query, search pgvector, rerank with a cross-encoder, return the top chunks.&lt;/p&gt;

&lt;p&gt;But with important enhancements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Section-aware chunking&lt;/strong&gt;: Instead of naive 512-token chunks, the system detects subsection headings within the Risk Factors and MD&amp;amp;A sections, treating each subsection as a chunk candidate. Split points snap to sentence boundaries, never mid-sentence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-encoder reranking&lt;/strong&gt;: The initial pgvector search returns ~100 candidates. These are reranked using &lt;code&gt;cross-encoder/ms-marco-MiniLM-L-6-v2&lt;/code&gt;, which scores each (query, chunk) pair directly. This is critical for financial text — a query about "revenue growth drivers" has high cosine similarity with "revenue recognition policies" (shares the word "revenue") but low actual relevance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk-specific sub-queries&lt;/strong&gt;: For risk analysis queries, the system generates additional search queries targeting specific risk categories (litigation, debt, competition, cybersecurity) to cast a wider net.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Route 5: &lt;code&gt;hybrid&lt;/code&gt; — Structured + Narrative
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;For&lt;/strong&gt;: "Compare AAPL vs MSFT revenue and explain what drove the difference"&lt;/p&gt;

&lt;p&gt;This is the most complex route. It combines relational retrieval (XBRL numbers) with narrative retrieval (MD&amp;amp;A text) and presents both to the LLM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Classify → Determine retrieval_intent
         → If "timeseries": retrieve_timeseries()
         → If "full_statement": retrieve_full_statement()
         → Else: retrieve_metric_lookup()
         → ALSO: retrieve_narrative() for context
         → Merge relational + vector results
         → Format with separate "Structured Data" and "Narrative" sections
         → Generate answer with instructions to use numbers from XBRL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: the LLM is explicitly told to use XBRL data for hard numbers and narrative sections for qualitative context. Without this instruction, the LLM would sometimes extract numbers from narrative paragraphs instead of the authoritative XBRL data — producing slightly different values due to rounding in the prose.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Query Classifier
&lt;/h3&gt;

&lt;p&gt;Routing is done by GPT-4o-mini with function calling. The function schema extracts 15+ fields from a natural language query:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Route (metric_lookup, timeseries, full_statement, narrative, hybrid)&lt;/li&gt;
&lt;li&gt;Tickers&lt;/li&gt;
&lt;li&gt;Fiscal year(s)&lt;/li&gt;
&lt;li&gt;XBRL concepts&lt;/li&gt;
&lt;li&gt;Retrieval intent&lt;/li&gt;
&lt;li&gt;Temporal granularity (annual/quarterly/specific_quarter)&lt;/li&gt;
&lt;li&gt;Target sections&lt;/li&gt;
&lt;li&gt;Search queries for vector retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why function calling instead of prompt-based classification?&lt;/strong&gt; Structured output. A prompt-based approach ("respond with JSON") frequently produced malformed output for complex queries. Function calling returns typed fields — no parsing failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why GPT-4o-mini instead of GPT-4o?&lt;/strong&gt; Classification is pattern matching, not deep reasoning. GPT-4o-mini handles it at 10x lower cost with comparable accuracy. The savings compound — classification runs on every query.&lt;/p&gt;

&lt;h3&gt;
  
  
  Post-Classification Safety Nets
&lt;/h3&gt;

&lt;p&gt;The classifier is right ~95% of the time. The other 5% cause the worst user experiences. Two heuristics catch the common misroutes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Risk queries misrouted to metric_lookup&lt;/strong&gt;: When the LLM sees "Apple" and defaults to financial data, even though the query is about risk factors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric_lookup&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeseries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
   &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\brisk factors?\b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
   &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xbrl_concepts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;route&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;narrative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Quantitative queries misrouted to narrative&lt;/strong&gt;: When the LLM routes to narrative but its own metadata includes XBRL concepts and a quantitative intent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;narrative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; \
   &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xbrl_concepts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
   &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specific_metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comparison&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;route&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric_lookup&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are band-aids. The LLM should get these right. But in practice, having deterministic safety nets for known failure patterns is cheaper than perfect classification.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 5: Making Answers Trustworthy
&lt;/h2&gt;

&lt;p&gt;Getting the right data is half the battle. The other half is making the LLM produce accurate, verifiable answers from that data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Formatting: Route-Specific
&lt;/h3&gt;

&lt;p&gt;The context string sent to the LLM is formatted differently for each route:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Relational routes&lt;/strong&gt; get structured, labeled data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## XBRL Financial Facts
- **us-gaap:Revenues** (2023): 383.29 billion USD
  [XBRL: us-gaap:Revenues | AAPL, FY 2023]

## Year-over-Year Comparison
- AAPL revenue FY2022→FY2023: 394.33B → 383.29B (-2.8%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Narrative routes&lt;/strong&gt; get section chunks with metadata headers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;### Chunk 1 [10-K] META FY2025 - Risk Factors (score: 1.780)
[Source: 10-K, Ticker: META, Year: 2025, Section: Risk Factors]
Certain factors may have a material adverse effect on our business...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Hybrid routes&lt;/strong&gt; get both, separated by headers. The system prompt explicitly says: "Use XBRL data for top-line metrics. Use narrative sections for qualitative breakdowns."&lt;/p&gt;

&lt;h3&gt;
  
  
  The Financial Snapshot
&lt;/h3&gt;

&lt;p&gt;For risk analysis and comparative queries, a "financial snapshot" is prepended to the context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## META FY2025 — Financial Snapshot
- Revenue: $200.97B
- Net Income: $60.46B
- Total Assets: $366.02B
- Long-Term Debt: $58.74B
- Cash &amp;amp; Equivalents: $35.87B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives the LLM quantitative grounding when discussing risks. Without it, the LLM produces generic risk factor summaries. With it, the LLM can say "Meta's $58.74B in long-term debt creates refinancing risk..." — much more useful.&lt;/p&gt;

&lt;p&gt;A fun bug I discovered here: the snapshot was accidentally pulling from the &lt;code&gt;quarterly_facts&lt;/code&gt; table (default) instead of &lt;code&gt;annual_facts&lt;/code&gt;. So NVIDIA's revenue showed as $2.22B (Q1 value) instead of $10.92B (annual total). The LLM confidently used the wrong number because it was labeled "Revenue" in the context. Changed one parameter — &lt;code&gt;table="annual_facts"&lt;/code&gt; — and the numbers were correct. One parameter. Hours of debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Contradiction Detection
&lt;/h3&gt;

&lt;p&gt;This is one of my favorite features. The system cross-references narrative claims against XBRL data to catch inconsistencies.&lt;/p&gt;

&lt;p&gt;If a narrative chunk says "revenue increased" but the XBRL YoY comparison shows a decrease greater than 2%, this is flagged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;narrative_direction&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;increase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;data_direction&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decrease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;contradictions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;direction_mismatch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pct_change&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Narrative says revenue increased, but data shows &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pct_change&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It also catches magnitude discrepancies — when the narrative claims "grew approximately 20%" but the actual XBRL data shows 5.3% growth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does this matter?&lt;/strong&gt; SEC filing narratives are written by management. They can contain optimistic framing. "Revenue remained strong" might accompany a 3% decline. Cross-referencing against hard numbers catches this.&lt;/p&gt;

&lt;p&gt;All the keywords, thresholds, and watched concepts are configured in &lt;code&gt;guardrails.yaml&lt;/code&gt; — no hardcoded values. The &lt;code&gt;direction_tolerance_pct&lt;/code&gt; is 2% (small changes aren't contradictions), and the &lt;code&gt;magnitude_tolerance_ppt&lt;/code&gt; is 5 percentage points (narrative approximations get some slack).&lt;/p&gt;

&lt;h3&gt;
  
  
  Confidence Scoring
&lt;/h3&gt;

&lt;p&gt;Every answer gets a confidence score (0-100) from 5 weighted signals:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;th&gt;Weight&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retrieval Quality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Source reliability (XBRL=90, statements=72, narrative=varies by rerank score)&lt;/td&gt;
&lt;td&gt;25-40%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Source Coverage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;What fraction of requested (ticker, year, concept) tuples had data&lt;/td&gt;
&lt;td&gt;20-25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cross-Source Agreement&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Contradiction penalties (-25 per high severity, -15 per medium)&lt;/td&gt;
&lt;td&gt;15-20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Citation Density&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Citations per sentence vs. target (0.5)&lt;/td&gt;
&lt;td&gt;10-15%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Recency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Staleness penalty (-20 per year gap between requested and available data)&lt;/td&gt;
&lt;td&gt;10-15%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The weights are route-specific: relational routes weight retrieval quality at 40% (high confidence in XBRL data), narrative routes weight it at 25% (inherently less precise), timeseries routes weight source coverage at 35% (incomplete data is the primary failure mode).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tiers&lt;/strong&gt;: High (75+, green), Moderate (50-74, yellow), Low (25-49, red).&lt;/p&gt;

&lt;p&gt;Everything — weights, thresholds, tier boundaries, keywords — lives in &lt;code&gt;guardrails.yaml&lt;/code&gt;. This was one of the best architectural decisions. Tuning confidence scoring doesn't require code changes. Just edit the YAML.&lt;/p&gt;

&lt;h3&gt;
  
  
  Citation System
&lt;/h3&gt;

&lt;p&gt;Every factual claim in the answer includes an inline citation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Source: 10-K, Ticker: AAPL, Year: 2024, Section: Income Statement]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Post-generation, the API server enriches these with actual SEC EDGAR URLs by looking up accession numbers from the &lt;code&gt;filings&lt;/code&gt; table. The frontend renders clickable links directly to the filing on sec.gov.&lt;/p&gt;

&lt;p&gt;This sounds like a small detail, but it's the difference between "trust me" and "here's the source."&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 6: Multi-Company Queries (Harder Than You'd Think)
&lt;/h2&gt;

&lt;p&gt;Comparing two companies isn't "run the single-company pipeline twice." I learned this the hard way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reranking Scores Aren't Comparable Across Tickers
&lt;/h3&gt;

&lt;p&gt;When comparing Apple and Microsoft, narrative retrieval runs separate vector searches for each company. The cross-encoder reranking scores are query-dependent — Apple chunks are scored against the comparison query, and so are Microsoft chunks.&lt;/p&gt;

&lt;p&gt;But these scores are &lt;strong&gt;not comparable across companies&lt;/strong&gt;. One company's filing text might consistently score higher simply because of writing style or keyword density. Without intervention, Apple's chunks could dominate the results even if the user asked for an equal comparison.&lt;/p&gt;

&lt;p&gt;The fix: per-ticker fair allocation in guardrails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_multi_ticker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;per_ticker_max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_chunks&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tickers&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ticker_chunks&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;group_by_ticker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;ticker_chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rerank_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;kept&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticker_chunks&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;per_ticker_max&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each company gets an equal share of the chunk budget, sorted by rerank score within each group.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Decomposition
&lt;/h3&gt;

&lt;p&gt;Multi-ticker queries are decomposed into per-ticker sub-queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Parent: "Compare AAPL and MSFT revenue growth 2020-2024"
  → Sub-query 1: "AAPL revenue data 2020-2024"
  → Sub-query 2: "MSFT revenue data 2020-2024"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sub-queries execute in parallel using &lt;code&gt;ThreadPoolExecutor(max_workers=4)&lt;/code&gt;. This was a significant performance improvement — sequential execution added 5-10 seconds of latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fiscal Year Differences in the Answer
&lt;/h3&gt;

&lt;p&gt;Apple's FY2023 ends in September 2023. Microsoft's FY2023 ends in June 2023. When the answer says "FY2023 revenue," these cover different calendar periods. The system prompt explicitly instructs the LLM to note fiscal year differences when comparing companies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 7: Caching, Cost, and Deployment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Three-Layer Cache
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;TTL&lt;/th&gt;
&lt;th&gt;What's Cached&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Query Results&lt;/td&gt;
&lt;td&gt;1 hour&lt;/td&gt;
&lt;td&gt;Full answer + sources + confidence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Classification&lt;/td&gt;
&lt;td&gt;2 hours&lt;/td&gt;
&lt;td&gt;Route + extracted metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval&lt;/td&gt;
&lt;td&gt;1 hour&lt;/td&gt;
&lt;td&gt;Raw retrieval data per (route + tickers + years + concepts)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The classification cache has a longer TTL because classification is deterministic for the same query. The retrieval cache keys are hashed from a composite of all retrieval parameters, so "Apple's revenue in 2024" and "AAPL revenue FY2024" — different query strings but identical retrieval parameters — share the same cache entry.&lt;/p&gt;

&lt;p&gt;Everything uses Redis with graceful degradation — if Redis is unavailable, the system works without caching. No crashes, just slower.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Tracking
&lt;/h3&gt;

&lt;p&gt;Every query tracks its OpenAI API cost across all phases. A typical query costs $0.0003-$0.0007 — well under a penny. The frontend displays a cost card with per-phase breakdown, total cost, token counts, and an efficiency grade (S through C).&lt;/p&gt;

&lt;p&gt;I added this for myself during development, but users love it. It builds trust — they can see that a simple metric lookup costs almost nothing, while a complex hybrid comparison costs a fraction of a cent more.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment: Railway + Vercel
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vercel&lt;/strong&gt; (frontend): Free tier for React, global CDN, auto-deploy from GitHub&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Railway&lt;/strong&gt; (backend + database): $5/month Hobby plan for FastAPI + PostgreSQL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The database was the interesting challenge. Railway's managed PostgreSQL doesn't include pgvector. I deployed a custom Docker image (&lt;code&gt;pgvector/pgvector:pg17&lt;/code&gt;) as a Railway service. This required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom volume mount (&lt;code&gt;PGDATA=/data/postgres&lt;/code&gt;) because Railway's default creates a &lt;code&gt;lost+found&lt;/code&gt; that conflicts with PostgreSQL's data directory&lt;/li&gt;
&lt;li&gt;Manual TCP proxy for external access during migration&lt;/li&gt;
&lt;li&gt;Reducing IVFFlat index lists from 100 to 50 because Railway's 64MB &lt;code&gt;maintenance_work_mem&lt;/code&gt; wasn't enough for the larger index build&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data migration (3 GB) from local to Railway: &lt;code&gt;pg_dump&lt;/code&gt; → &lt;code&gt;pg_restore&lt;/code&gt;, with a version-matching headache (local pg17 vs system pg_dump v14) that took longer to debug than the actual migration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Domain complexity dominates technical complexity.&lt;/strong&gt; The hardest problems weren't about vector search or LLM prompting — they were about fiscal year conventions, XBRL taxonomy evolution, and Q4 derivation. A naive RAG system that ignores domain details produces confidently wrong answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Structured data should stay structured.&lt;/strong&gt; I tried embedding financial tables early on. It was worse in every dimension: higher storage cost, slower retrieval, lower answer quality. Financial numbers should be queried relationally. Only narrative text benefits from semantic search.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. LLM classification needs deterministic safety nets.&lt;/strong&gt; 95% accuracy sounds high, but the 5% of misrouted queries cause the worst user experiences. Rule-based overrides for known failure patterns are cheap insurance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Multi-entity queries are fundamentally different from single-entity.&lt;/strong&gt; Reranking scores aren't comparable across tickers. Chunk budgets need fair allocation. Answer prompts need fiscal year difference warnings. You can't just loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. One parameter can ruin everything.&lt;/strong&gt; Using &lt;code&gt;quarterly_facts&lt;/code&gt; instead of &lt;code&gt;annual_facts&lt;/code&gt; for the financial snapshot (a default parameter I didn't override) produced wrong numbers that the LLM confidently presented. The data was technically correct — it was a Q1 revenue number. But it wasn't the &lt;em&gt;right&lt;/em&gt; Q1 revenue number for an annual snapshot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Config-driven guardrails enable iteration.&lt;/strong&gt; Moving all thresholds, weights, and keywords into &lt;code&gt;guardrails.yaml&lt;/code&gt; was one of the best decisions. Tuning confidence scoring or contradiction detection doesn't require code changes — just a YAML edit and a deploy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It / Read the Code
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Live demo:&lt;/strong&gt; &lt;a href="https://sec-intelligence-system.vercel.app" rel="noopener noreferrer"&gt;sec-intelligence-system.vercel.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/bhattaraisubal-eng/sec-intelligence-system" rel="noopener noreferrer"&gt;github.com/bhattaraisubal-eng/sec-intelligence-system&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coverage:&lt;/strong&gt; AAPL, MSFT, NVDA, AMZN, GOOGL, META, BRK-B, LLY, AVGO, JPM — 10-K and 10-Q filings from 2010 to present.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt; FastAPI, PostgreSQL + pgvector, OpenAI text-embedding-3-small, GPT-4o-mini, cross-encoder/ms-marco-MiniLM-L-6-v2, React + Tailwind CSS&lt;/p&gt;

&lt;p&gt;Some queries to try:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What was Apple's revenue in 2023?"&lt;/li&gt;
&lt;li&gt;"How has NVIDIA revenue changed from 2020 to 2024?"&lt;/li&gt;
&lt;li&gt;"Compare net income AAPL vs MSFT 2023"&lt;/li&gt;
&lt;li&gt;"What are the key risk factors in Meta's latest 10-K?"&lt;/li&gt;
&lt;li&gt;"Show JPMorgan balance sheet for 2023"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building RAG systems over domain-specific data — especially data that mixes structured and unstructured information — I'd love to hear about your experience. The biggest thing I learned is that the domain shapes the architecture far more than the AI tools you choose.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you found this useful, I'm happy to answer questions in the comments about any specific design decision. The &lt;a href="https://github.com/bhattaraisubal-eng/sec-intelligence-system/tree/main/docs" rel="noopener noreferrer"&gt;detailed architecture docs&lt;/a&gt; in the repo go even deeper into each component.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
