DEV Community: Matthew Karaula

Building a PDF Parser for Financial Data: Lessons from Arbiter V2

Matthew Karaula — Fri, 01 May 2026 09:40:59 +0000

I’m Matthew, building Arbiter Briefs — an AI engine that helps founders make high-stakes decisions. This week we shipped financial PDF ingestion, and I want to walk through the architecture, the gotchas, and why we chose regex over ML for extraction.
The Problem
Our v1 was generating rulings based on web research + user input. But founders kept saying the same thing: “This would be way more useful if you actually read my financial data.”
So we added PDF upload. But now we had a new problem: how do you reliably extract structured financial metrics from PDFs that could be formatted a hundred different ways?
We could’ve gone full ML pipeline. Instead, we went pragmatic.

Architecture Overview

PDF Upload (multer)
↓
Storage (Railway volume)
↓
Parse (pdf-parse)
↓
Extract (regex + heuristics)
↓
Store (PostgreSQL JSONB)
↓
Use in Ruling (context injection)

Simple. Async. Testable.

Step 1: Upload (Multer)
We use multer for file handling — it’s simple, battle-tested, and handles multipart form data without fuss.

Upload constraints:
• Max 10MB per file (covers P&Ls, balance sheets, cap tables)
• Max 5 files per analysis (prevents abuse)
• Only PDF files accepted
• In-memory buffer (files are saved to disk immediately after)

Why these limits?
• 10MB keeps parsing under 5 seconds
• 5 files per analysis is enough context without overwhelming the system
• Railway volumes can handle it without quota issues

Step 2: Storage (Railway Persistent Volume)
We’re on Railway. Persistent volumes are simple: you mount a folder at /app/uploads, and files survive deploys.
Folder structure: /uploads/{userId}/{analysisId}/{uuid}-filename.pdf

This approach:
• Keeps files organized and private (users can’t enumerate each other’s documents)
• Makes cleanup easy (delete an analysis folder, files are gone)
• Survives deploys without S3 complexity
Why not S3?
• We’re pre-launch. S3 adds cost (~$0.023/GB/month) and infrastructure overhead
• Railway volumes are free up to 5GB
• We can migrate to S3 in 30 minutes when we hit scale
Why this folder structure?
• Privacy isolation: each user’s files are in their own path
• Easy multi-tenant if we ever need it
• Simple to debug (“ls /uploads/{userId}/{analysisId}” shows what’s there)

Step 3: Parse (pdf-parse Library)
We use pdf-parse (npm package) to extract text and metadata from PDFs. It handles the heavy lifting — text extraction, page count, embedded metadata.

Why pdf-parse?
• Lightweight (~50KB, no external dependencies)
• Fast (parses a 20-page PDF in <1 second)
• Good enough for searchable PDFs
Caveat: pdf-parse struggles with:
• Scanned PDFs (images, not text)
• Heavily formatted tables (preserves layout, loses structure)
• Non-standard encodings

For Alpha 2, we’re assuming users upload searchable PDFs. If we get complaints, we’ll upgrade to a heavier library or integrate GPT-4o’s PDF API.
Step 4: Extract (Regex + Heuristics)

We detect document type first (P&L vs. balance sheet vs. cap table) by scanning for keyword signals. Then we extract relevant metrics using regex patterns.

Document type detection:
Look for keywords like “profit and loss” → P&L, “balance sheet” → balance sheet, “cap table” → cap table. If we hit 2+ keywords for a type, that’s our label.
Metric extraction:

Once we know the type, we target specific line items:
• P&L: Revenue, COGS, gross profit, operating expenses, EBITDA, net income
• Balance Sheet: Total assets, cash, liabilities, equity, debt
• Cap Table: Share classes, fully diluted, option pool

We also extract all dollar amounts ($1.2M, $1,234,567, $2B, etc.) and store them as detected numbers.
Why regex instead of ML?
• Speed: Regex runs in milliseconds. ML models take seconds.
• Cost: We’re pre-launch. No API spend yet.
• Simplicity: One founder + one engineer can own it.
• Good enough: Catches ~80% of cases correctly.
The trade-off: Regex fails on non-standard formatting, scanned PDFs, and non-English text. Week 4 plan: Feed extracted text through GPT-4o with a structured prompt to handle edge cases.

Step 5: Store (PostgreSQL JSONB)
We store extracted metrics in a financial_documents table with an extracted_data JSONB column. This gives us flexibility (new metrics don’t require migrations) and queryability (can index on specific fields).

What extracted data looks like:

{
"documentType": "p&l",
"keyMetrics": {
"revenue": 2400000,
"cogs": 800000,
"grossProfit": 1600000,
"ebitda": 400000
}
}

Why JSONB?
• Flexible schema (add new metrics without schema changes)
• Queryable (can build reports filtering on specific extracted values)
• Easy to version (old data remains valid when extraction logic evolves)

Step 6: Async Parsing
Critical architecture decision: Parsing happens asynchronously in the background.

When a user uploads a PDF:
1. File gets saved to disk immediately
2. We insert a “pending” record in the database
3. Return 201 OK to the frontend in ~200ms
4. Parsing runs in the background (takes 5–10 seconds)
5. Frontend polls every 3 seconds to check status
6. When parsing completes, status badge updates from “Pending” → “Parsed” or “Failed”

Why async?
• Upload returns fast (good UX)
• Parsing doesn’t block other requests
• User doesn’t wait for slow PDFs
• If parsing fails, user can click “Retry”
This approach is essential for any file-based feature. Blocking on parse would timeout at 30 seconds and frustrate users.

Frontend (React)
On the frontend, users see:
• Drag-and-drop zone for uploading PDFs
• Status badges for each document (Pending → Parsed → Failed)
• Retry button if parsing fails
• Delete button to remove a document

The UI polls the backend every 3 seconds while any document is “Pending”. As parsing completes, badges update in real-time. No page refresh needed.

It’s simple. Minimal. Exactly what you need.

What We Learned
1. Start with regex. It’s fast, debuggable, and good enough for MVP. Upgrade to ML when you have clear signals it’s failing.
2. Async is essential. If parsing blocks the response, your UX suffers. Let it run in background.
3. JSONB is your friend. Don’t try to normalize financial data into relational tables. Store as JSON, query as needed.
4. Test with real PDFs early. Every PDF format is slightly different. Our regex catches ~80% of P&Ls correctly. The other 20% need manual tweaks or GPT-4o.
5. Storage matters at scale. Railway volumes are great for <5GB. If you grow past that, migrate to S3 preemptively.
What’s Next
• Week 4: Replace regex with GPT-4o structured extraction (handles edge cases, learns from failures)
• Week 5–6: Financial modeling (sensitivity analysis using extracted metrics)
• Week 7: MiroFish integration (stakeholder simulation)
• Week 8: Visual graphs (tornado charts, waterfall charts)

If you’re building something similar, happy to answer questions in the comments. And if you’re a founder facing high-stakes decisions, we’re building the tool for you. Early access: arbiterbriefs.com

How I Rebuilt My AI Decision Tool From a Summarizer Into a Constraint-Driven Arbitrator

Matthew Karaula — Fri, 10 Apr 2026 04:47:49 +0000

A few weeks ago, I shipped a tool called Arbiter that takes a business decision, runs it through GPT-4o, and returns a structured analysis. The output looked impressive. Recommendation, confidence score, pros and cons, risk ratings, next steps. Everything you'd expect from an AI decision tool.

Then I posted it on Reddit and got destroyed in the comments.
Not because the output was wrong. Because the output was vague. One commenter pointed out that the AI was just hand-waving its way to a conclusion. Another asked how it handled contradictory evidence between different perspectives. A third said the confidence scores felt arbitrary there was no mechanism that would actually drop confidence when the evidence was weak.

They were right. I was running a single LLM call with a clever prompt and pretending it was decision intelligence.

This post is about how I rebuilt the pipeline to actually adjudicate decisions instead of summarizing them, and the architectural decisions that made the difference.

The original and how it failed

The first version was simple. One system prompt, one user prompt, one JSON response.
User input → GPT-4o (with structured prompt) → JSON output
The prompt asked the model to play "senior strategy analyst," analyze options, return pros and cons, and assign a confidence score. It worked in the sense that it produced reasonable-looking output. It failed in three specific ways.

First, the model could justify any conclusion with confident-sounding prose. There was no internal mechanism forcing it to actually weigh evidence it just had to sound like it did.

Second, confidence scores were cosmetic. The model would output 85% confidence on a vague decision and 75% on a well-defined one, with no consistent logic. I couldn't trace where the score came from.

Third, when the same decision was run twice, the recommendations would sometimes flip. A single LLM call has no internal debate mechanism whichever framing the model latched onto first won.

The redesign: separating extraction from advocacy from adjudication
The core insight was that real decision-making isn't a single act of reasoning. It's at least three distinct cognitive operations:

Defining what success looks like (constraints, criteria, non-negotiables)
Building the strongest case for each option (advocacy)
*Evaluating each case against the success criteria *(adjudication)

A single LLM call was trying to do all three at once, which is why it could rationalize any answer. The fix was to separate them into distinct stages where each stage's output became a hard input to the next.
Here's the new pipeline:
User input
↓
Stage 1: Constraint Extraction
↓
Stage 2: Research (with web search)
↓
Stage 3: Independent Advocates (parallel)
↓
Stage 4: Arbitrator
↓
Decision Brief

Each stage has a specific job, runs as its own LLM call with its own system prompt, and passes structured JSON to the next stage. Let me walk through what each one does and why it matters.

Stage 1: Constraint extraction

This was the biggest unlock. Before any reasoning happens, the system extracts a normalized constraint framework from the user's inputs.

json{
"hard_constraints": [
{"id": "HC1", "constraint": "Budget capped at $300K", "source": "user_input"}
],
"soft_constraints": [
{"id": "SC1", "constraint": "Minimize disruption to existing team", "weight": "high"}
],
"decision_criteria": [
{"id": "DC1", "criterion": "Operational within 4 months", "measurable": "go-live date"}
],
"risk_tolerance": "moderate",
"non_negotiables": ["No customer downtime"],
"unknown_critical_inputs": ["Current team capacity"]
}

The point isn't the format. The point is that every downstream stage now references the same constraint IDs. When an advocate argues for an option, they have to explicitly show which constraints their option satisfies. When the Arbitrator scores options, it scores them against the same constraint set, not against free-form prose.
This single change eliminated about 80% of the hand-waving. The model couldn't just say "this option seems best" anymore. It had to point to specific constraints and show satisfaction.
The other useful thing constraint extraction does is identify what the user didn't tell you. The unknown_critical_inputs field forces the model to flag missing information. That data later becomes input to the confidence calculation.

Stage 2: Research with real web search

The original tool relied entirely on training data for "industry context." The output looked authoritative but was completely ungrounded — citing statistics that may or may not exist, referencing competitor moves the model imagined.

The fix was Tavily, a search API designed for LLM consumption. The Research Agent generates three focused search queries from the decision context, executes them in parallel, and synthesizes the results into structured findings.

The key design decision was how to handle uncertainty about source quality. Rather than pretending every claim is equally evidenced, every finding gets tagged:

json{
"claim": "Australian SaaS NRR averaged 112% in Q4 2025",
"evidence_strength": "high",
"source_type": "cited",
"source_url": "https://...",
"source_title": "..."
}

source_type is one of cited, inference, or model_knowledge. evidence_strength is high, medium, or low. The rule baked into the prompt: a claim cannot be marked as high-strength evidence unless it has a real URL backing it.
This sounds obvious but it took multiple iterations to get the model to actually respect it. Models have a strong default behavior of confidently asserting things. Breaking that habit required restating the rules in three different places in the prompt and explicitly forbidding fabricated citations.

Stage 3: Parallel advocates

For each option the user provides, an advocate LLM call runs in parallel, building the strongest possible case. The system prompt instructs them to be persuasive but honest, and crucially:

Your argument must be structured around the CONSTRAINTS defined by the decision analyst. You cannot hand-wave, you must explicitly show how your option satisfies each hard constraint, decision criterion, and key soft constraint.

Each advocate returns:
json{
"option": "Option A",
"executive_argument": "...",
"constraint_satisfaction": [
{"constraint_id": "HC1", "satisfied": "yes|partial|no", "reasoning": "..."}
],
"supporting_evidence": [
{"point": "...", "evidence_strength": "high|medium|low", "source_ref": "..."}
],
"acknowledged_weaknesses": [...]
}
The acknowledged_weaknesses field matters. Without it, advocates produced suspiciously one-sided arguments. Forcing them to acknowledge their own option's weaknesses produced more honest output, and gave the Arbitrator material to work with in the next stage.
Running advocates in parallel was an obvious win for latency. Three options means three concurrent LLM calls instead of three sequential ones.

Stage 4: The Arbitrator

This is where the real adjudication happens. The Arbitrator receives the constraint framework, the research findings, and all advocate arguments. Its system prompt explicitly tells it that its job is not to summarize:

You are NOT summarizing the advocates. You are ADJUDICATING.
Your process:

Score each option against the constraints
Identify contradictions between advocate arguments resolve them with evidence.

Assess evidence strength for each advocate's claims
Deliver a clear ruling. Do not hedge.
Assess your own confidence based on constraint clarity, evidence quality, advocate agreement, and unknown critical inputs

The output includes a constraint scorecard that maps every constraint to a pass/partial/fail rating per option, a list of contradictions between advocates with how they were resolved, sensitivity variables (concrete values that would flip the ruling), and the actual ruling itself.
The most important field is certainty_rationale. The model has to explain why its confidence is what it is. This makes the score legible — you can see whether the 72% confidence comes from "strong evidence but advocates disagree" or "weak evidence but clear constraint winner." Two different stories that should produce different actions from the user.

What this cost me
A single LLM call with the original architecture was about $0.02 per analysis on GPT-4o-mini. The new pipeline runs five LLM calls (constraint extraction, research synthesis, three advocates, arbitrator) plus three Tavily searches. Cost per brief is now closer to $0.10 on the same model. Latency went from ~15 seconds to ~45 seconds.

That's a 5x cost increase and 3x latency increase. For most consumer products it would be a bad trade. For a tool whose entire value proposition is "give me a structured ruling I can act on," it's worth it. Users will wait 45 seconds for output that actually helps them. They won't pay for output that looks like a ChatGPT response.

What I learned

Three things from the rebuild that I'd apply to any multi-stage LLM system.

Separation of concerns matters more than prompt engineering. I spent weeks trying to make a single prompt produce better output. Splitting that prompt into four prompts each with a narrow job did more in a day than the prompt tweaks did in two weeks. Each stage gets to specialize. Each stage's output becomes a hard constraint for the next stage instead of a suggestion.

Models will fabricate confidence unless you make confidence expensive. The original tool happily output 90% confidence because nothing in the prompt punished it for being overconfident. The new tool ties certainty to specific factors (evidence strength, advocate agreement, missing inputs) and forces the model to justify its score in writing. When the model has to explain its confidence, it gets more conservative.

Adversarial structure produces better reasoning than collaborative structure. The original prompt asked the model to "consider all perspectives." The new architecture has independent advocates each arguing their case, then a neutral arbitrator weighing them against criteria. The adversarial setup produces sharper arguments because each advocate is incentivized to make the strongest case. The arbitrator then has real material to weigh instead of mush.

The output
Here's a screenshot of a real Decision Brief from the new pipeline. The constraint scorecard at the top is the most visually distinctive thing — every option scored against every extracted constraint. Below it, the research section shows cited findings with evidence strength badges and clickable source URLs.

What's next

The pipeline still has weak spots. Constraint extraction is fragile when user inputs are sparse, garbage in, garbage out. I'm working on a constraint review step where the user can edit the extracted framework before advocates run. Evidence strength calibration is also conservative the model defaults to "medium" for almost everything unless there's a clearly cited stat. I'm experimenting with explicit calibration examples in the prompt.

If you want to play with the tool, it's at https://arbiter-frontend-iota.vercel.app/. Free tier gives you a few briefs per month, no credit card. Genuinely interested in feedback on where the pipeline breaks for your use case.