Rishabh Gupta

Posted on Jun 10 • Originally published at rishabh.fyi on Jun 8

How I Built a RAG System on the SpaceX S-1 in One Weekend

#aisystemdesign

SpaceX filed a 389-page S-1 on May 20, 2026. I read the news, opened the SEC EDGAR filing, and immediately hit the same wall everyone hits — 389 pages of dense legal and financial disclosure, no search, no way to ask a direct question and get a cited answer.

The summaries floating around were useful for headlines. Useless for anything specific. "SpaceX is profitable" tells you nothing about which segments are driving it, what the margin trajectory looks like, or what governance risks the company is flagging. For that, you need the actual text, with a page reference you can verify.

So I built AskS1.com. Here's what that actually involved — including the parts that didn't work.

Why RAG, Not Just Upload to Claude

The obvious approach is uploading the PDF to Claude or ChatGPT and asking questions. It works, mostly. But it has three problems.

First, the SpaceX S-1 was filed after most model training cutoffs. For specific figures the model has no training data — it either says "I don't know" or hallucinates a plausible number. I benchmarked this: asking Claude directly about SpaceX's 2025 revenue without context produces a confident wrong answer.

Second, a 395-page document (after amendments) strains context windows. Models start losing details from the middle of the document when they're trying to hold everything at once. Important disclosures on pages 80-200 get deprioritized for content near the beginning and end.

Third, citations are vague. "According to the filing" isn't useful when you're trying to verify a specific governance claim before an IPO.

RAG solves all three. You precompute the embeddings once, retrieve only the relevant chunks at query time, and the model sees focused context rather than 395 pages of noise.

Why Not Fine-Tune

Before settling on RAG, I considered fine-tuning a smaller model on the filing content. The results from my own benchmarking — fine-tuning Mistral-7B on 25 SpaceX Q&A pairs — ruled it out quickly.

Fine-tuning on a document teaches the model to reproduce facts it has seen during training. Ask it a question that maps closely to a training example and it answers well. Ask it anything slightly outside that distribution — a follow-up question, a cross-reference between sections, a question phrased differently — and it hallucinates confidently. The model has memorized, not understood.

RAG sidesteps this entirely. The model never sees the filing during training. At query time, relevant chunks are retrieved and injected as context. The model reads those chunks and answers from them. It's closer to open-book exam than memorization — and for a 395-page legal document with dense cross-references, open-book is the right approach.

Fine-tuning also has a practical problem for this use case: when SpaceX files an amendment — which they did twice within two weeks — the fine-tuned model is immediately stale. Re-ingesting a RAG pipeline takes under 5 minutes. Re-fine-tuning a model takes hours and compute budget.

The Architecture

SpaceX S-1 PDF (395 pages → 871 chunks)
    ↓ pdfplumber — extract text page by page
    ↓ sliding window chunker — 400 words, 100 overlap
    ↓ all-MiniLM-L6-v2 — embed chunks → 384-dim vectors
    ↓ Qdrant Cloud — store 871 vectors + page metadata

User question
    ↓ all-MiniLM-L6-v2 — embed query
    ↓ cosine similarity → top 15 candidates
    ↓ re-rank — penalize summary pages
    ↓ Claude Haiku — generate cited answer
    ↓ ±8 page range citation

Four components. Each does one thing.

Two separate models — intentional design. all-MiniLM-L6-v2 handles embeddings only. Claude Haiku handles generation only. Embedding models are optimized for semantic similarity — small, fast, deterministic, 384 dimensions. Generation models are optimized for instruction following and text quality. Using the same model for both would mean either a slow embedding step or a weak generation step. Keeping them separate is standard RAG practice and worth being explicit about.

Why Qdrant. Qdrant's free tier is generous enough for a single filing (871 chunks, 384 dimensions). The HNSW index makes similarity search fast at this scale. Local Qdrant works for development — Qdrant Cloud for production without managing infrastructure.

395 pages → 871 chunks. Average 2.2 chunks per page after the sliding window. Total vectors stored: 871 × 384 dimensions. Each chunk stores text, page number, and end page in the payload — retrieved alongside the vector for citation generation.

The Chunking Decision

400 words per chunk with 100-word overlap. Why these numbers?

Smaller chunks (200 words) lose context for multi-sentence financial disclosures. A revenue figure appears on one line; the explanation — segment breakdown, YoY comparison, key drivers — spans the next five sentences. Split at 200 words, you retrieve the number without the context.

Larger chunks (800 words) reduce retrieval precision. You retrieve more text than you need and dilute the relevant signal with adjacent content.

The 100-word overlap ensures no fact gets cut at a chunk boundary without appearing in an adjacent chunk. Any sentence that spans two chunks will be fully retrievable from either side.

Why Claude Haiku for Generation

I benchmarked five models on 15 SpaceX S-1 questions — factual recall, multi-step reasoning, and structured output — with RAG context injected each time:

Model	Overall	Latency
Claude Haiku	4.7/5	2.8s
phi4:14b (local)	4.5/5	27.6s
qwen2.5:14b (local)	4.4/5	26.9s
mistral:7b (local)	4.4/5	9.0s
deepseek-r1:14b (local)	4.3/5	102.8s

The quality gap between Haiku and local 14B models is 0.2 points. The latency gap is 10x. For a web product where users are waiting for an answer, Haiku wins decisively.

One interesting finding: structured output scores were nearly identical across all models (4.4-4.6). The differentiation came entirely from factual accuracy and reasoning — where Haiku's training data and instruction following consistently outperformed locally-run open models.

The Challenges

The summary pages problem.

The executive summary (pages 1-24) mentions every major topic at a high level — consistently scoring highest in semantic similarity for almost any query, even when detailed content existed 100+ pages later.

Fix: retrieve 15 candidates, then apply a 0.15 penalty to chunks from pages under 25. Most substantive disclosures live deeper in the filing. Penalizing the summary section keeps retrieval focused on the narrative sections where specific claims and governance details actually appear.

The page citation problem.

The most challenging aspect was generating accurate page citations. The core issue: the SEC EDGAR filing only exists as HTML, which I converted to PDF using Chrome's print function. Chrome's HTML reflow during rendering means the text layer in the PDF doesn't always align with what you see visually.

What I tried first — standalone number regex

The first attempt looked for standalone numbers at the bottom of each page. Failed immediately — financial tables, footnote numbers, and reference counts appear throughout the page content including near the bottom. Too many false positives to be reliable.

What I tried second — Chrome's N/313 footer regex

Chrome adds N/313 page indicators in the footer during printing. I wrote a regex to extract it.

In theory this pattern is unique and can't appear elsewhere in the filing. In practice it was unreliable — the footer text wasn't always cleanly captured by pdfplumber's text extraction, so the regex frequently missed pages.

What I tried third — WeasyPrint HTML→PDF conversion

WeasyPrint converts HTML to properly paginated PDF where the text layer and visual layer are aligned by design. This would have eliminated the problem entirely. Failed on macOS — requires GTK libraries (libgobject, pango, cairo) that don't install cleanly on macOS without significant dependency management. Abandoned after an hour of dependency hell.

What I tried fourth — paged.js

A JavaScript library specifically designed for CSS-based HTML pagination. More macOS-friendly than WeasyPrint. The 11.8MB HTML filing with separately hosted image assets made this impractical — the converted PDF would be missing all images and the pagination would differ from the original rendering anyway.

What actually works — position-based extraction

The winning approach uses pdfplumber's coordinate system directly. Instead of parsing text, it looks for a standalone digit in the bottom 10% of the page, centered between 20–80% of the page width.

This reliably catches the printed page number without depending on text extraction of footer lines. Citations display a ±8 page range to account for any remaining rendering uncertainties.

Demo card caching

The /api/demo route is intentionally cached by Next.js. The three demo questions are fixed, the underlying data doesn't change between ingestion runs, and the answers are expensive to generate — hitting both Qdrant and the Claude API on every page load would add latency for no benefit. Cached results mean the landing page loads fast every time.

The filing is a moving target.

SpaceX filed two amendments after the original S-1 — S-1/A #1 on June 1 and S-1/A #2 on June 3 — with updated financials and the IPO price range ($135/share). The RAG pipeline re-ingests any filing version in under 5 minutes. When Anthropic and OpenAI file their S-1s later this year, the same pipeline handles them.

Conversation Memory

The app maintains conversation history across turns. Follow-up questions work without re-explaining context — "which segment is most profitable?" after asking about revenue breakdown uses the prior exchange. History is passed as the Anthropic messages array, capped at the last 10 exchanges to keep context window usage bounded.

Stack

Frontend: Next.js 14 on Railway. Migrated from a Streamlit prototype — easier to keep the same platform than migrate.
Vector storage: Qdrant Cloud. Free tier covers a single filing comfortably. HNSW index, no infrastructure to manage.
Generation: Anthropic API (Claude Haiku). Chosen on latency and quality benchmarks above.
Embeddings: @xenova/transformers running all-MiniLM-L6-v2 in Node.js. Runs entirely locally — no embedding API calls at query time, which reduces latency and cost per query. Ingestion is separated from retrieval; embeddings are computed once and pushed to Qdrant Cloud.
Domain: Cloudflare. AskS1.com at ~$10/year.

What's Next

Anthropic and OpenAI S-1s are expected soon. AskS1 will be there when they file.

DEV Community

How I Built a RAG System on the SpaceX S-1 in One Weekend

Top comments (0)