DEV Community

Matthew Karaula
Matthew Karaula

Posted on

Building a PDF Parser for Financial Data: Lessons from Arbiter V2

I’m Matthew, building Arbiter Briefs — an AI engine that helps founders make high-stakes decisions. This week we shipped financial PDF ingestion, and I want to walk through the architecture, the gotchas, and why we chose regex over ML for extraction.
The Problem
Our v1 was generating rulings based on web research + user input. But founders kept saying the same thing: “This would be way more useful if you actually read my financial data.”
So we added PDF upload. But now we had a new problem: how do you reliably extract structured financial metrics from PDFs that could be formatted a hundred different ways?
We could’ve gone full ML pipeline. Instead, we went pragmatic.

Architecture Overview

PDF Upload (multer)

Storage (Railway volume)

Parse (pdf-parse)

Extract (regex + heuristics)

Store (PostgreSQL JSONB)

Use in Ruling (context injection)

Simple. Async. Testable.

Step 1: Upload (Multer)
We use multer for file handling — it’s simple, battle-tested, and handles multipart form data without fuss.

Upload constraints:
• Max 10MB per file (covers P&Ls, balance sheets, cap tables)
• Max 5 files per analysis (prevents abuse)
• Only PDF files accepted
• In-memory buffer (files are saved to disk immediately after)

Why these limits?
• 10MB keeps parsing under 5 seconds
• 5 files per analysis is enough context without overwhelming the system
• Railway volumes can handle it without quota issues

Step 2: Storage (Railway Persistent Volume)
We’re on Railway. Persistent volumes are simple: you mount a folder at /app/uploads, and files survive deploys.
Folder structure: /uploads/{userId}/{analysisId}/{uuid}-filename.pdf

This approach:
• Keeps files organized and private (users can’t enumerate each other’s documents)
• Makes cleanup easy (delete an analysis folder, files are gone)
• Survives deploys without S3 complexity
Why not S3?
• We’re pre-launch. S3 adds cost (~$0.023/GB/month) and infrastructure overhead
• Railway volumes are free up to 5GB
• We can migrate to S3 in 30 minutes when we hit scale
Why this folder structure?
• Privacy isolation: each user’s files are in their own path
• Easy multi-tenant if we ever need it
• Simple to debug (“ls /uploads/{userId}/{analysisId}” shows what’s there)

Step 3: Parse (pdf-parse Library)
We use pdf-parse (npm package) to extract text and metadata from PDFs. It handles the heavy lifting — text extraction, page count, embedded metadata.

Why pdf-parse?
• Lightweight (~50KB, no external dependencies)
• Fast (parses a 20-page PDF in <1 second)
• Good enough for searchable PDFs
Caveat: pdf-parse struggles with:
• Scanned PDFs (images, not text)
• Heavily formatted tables (preserves layout, loses structure)
• Non-standard encodings

For Alpha 2, we’re assuming users upload searchable PDFs. If we get complaints, we’ll upgrade to a heavier library or integrate GPT-4o’s PDF API.
Step 4: Extract (Regex + Heuristics)

We detect document type first (P&L vs. balance sheet vs. cap table) by scanning for keyword signals. Then we extract relevant metrics using regex patterns.

Document type detection:
Look for keywords like “profit and loss” → P&L, “balance sheet” → balance sheet, “cap table” → cap table. If we hit 2+ keywords for a type, that’s our label.
Metric extraction:

Once we know the type, we target specific line items:
• P&L: Revenue, COGS, gross profit, operating expenses, EBITDA, net income
• Balance Sheet: Total assets, cash, liabilities, equity, debt
• Cap Table: Share classes, fully diluted, option pool

We also extract all dollar amounts ($1.2M, $1,234,567, $2B, etc.) and store them as detected numbers.
Why regex instead of ML?
• Speed: Regex runs in milliseconds. ML models take seconds.
• Cost: We’re pre-launch. No API spend yet.
• Simplicity: One founder + one engineer can own it.
• Good enough: Catches ~80% of cases correctly.
The trade-off: Regex fails on non-standard formatting, scanned PDFs, and non-English text. Week 4 plan: Feed extracted text through GPT-4o with a structured prompt to handle edge cases.

Step 5: Store (PostgreSQL JSONB)
We store extracted metrics in a financial_documents table with an extracted_data JSONB column. This gives us flexibility (new metrics don’t require migrations) and queryability (can index on specific fields).

What extracted data looks like:

{
"documentType": "p&l",
"keyMetrics": {
"revenue": 2400000,
"cogs": 800000,
"grossProfit": 1600000,
"ebitda": 400000
}
}

Why JSONB?
• Flexible schema (add new metrics without schema changes)
• Queryable (can build reports filtering on specific extracted values)
• Easy to version (old data remains valid when extraction logic evolves)

Step 6: Async Parsing
Critical architecture decision: Parsing happens asynchronously in the background.

When a user uploads a PDF:
1. File gets saved to disk immediately
2. We insert a “pending” record in the database
3. Return 201 OK to the frontend in ~200ms
4. Parsing runs in the background (takes 5–10 seconds)
5. Frontend polls every 3 seconds to check status
6. When parsing completes, status badge updates from “Pending” → “Parsed” or “Failed”

Why async?
• Upload returns fast (good UX)
• Parsing doesn’t block other requests
• User doesn’t wait for slow PDFs
• If parsing fails, user can click “Retry”
This approach is essential for any file-based feature. Blocking on parse would timeout at 30 seconds and frustrate users.

Frontend (React)
On the frontend, users see:
• Drag-and-drop zone for uploading PDFs
• Status badges for each document (Pending → Parsed → Failed)
• Retry button if parsing fails
• Delete button to remove a document

The UI polls the backend every 3 seconds while any document is “Pending”. As parsing completes, badges update in real-time. No page refresh needed.

It’s simple. Minimal. Exactly what you need.

What We Learned
1. Start with regex. It’s fast, debuggable, and good enough for MVP. Upgrade to ML when you have clear signals it’s failing.
2. Async is essential. If parsing blocks the response, your UX suffers. Let it run in background.
3. JSONB is your friend. Don’t try to normalize financial data into relational tables. Store as JSON, query as needed.
4. Test with real PDFs early. Every PDF format is slightly different. Our regex catches ~80% of P&Ls correctly. The other 20% need manual tweaks or GPT-4o.
5. Storage matters at scale. Railway volumes are great for <5GB. If you grow past that, migrate to S3 preemptively.
What’s Next
• Week 4: Replace regex with GPT-4o structured extraction (handles edge cases, learns from failures)
• Week 5–6: Financial modeling (sensitivity analysis using extracted metrics)
• Week 7: MiroFish integration (stakeholder simulation)
• Week 8: Visual graphs (tornado charts, waterfall charts)

If you’re building something similar, happy to answer questions in the comments. And if you’re a founder facing high-stakes decisions, we’re building the tool for you. Early access: arbiterbriefs.com

Top comments (0)