Parsing PDFs sounds easy until you try parsing bank statements.
I learned this the hard way.
I spent nearly 2 months building a Chase Bank PDF parser that reaches 99% accuracy across 23 real statements (1,123 transactions total). Meanwhile, generic converters like Tabula or PDFTables only hit ~70% on the same documents.
Here’s why Chase PDFs are much harder than you think—and how I solved the problems using TypeScript and pdfjs-dist, with real code you can copy.
⸻
Introduction
If you’ve ever worked with U.S. banking data, you know that Chase Bank does something strange:
They only let you download the last 18 months of transactions as CSV.
CPAs, bookkeepers, and backend engineers quickly hit a wall when they need 5+ years of historical data. Chase provides those older statements only as PDFs—and the PDFs are absolutely not designed for machine parsing.
Most accountants spend 45–60 minutes manually retyping each statement into QuickBooks or Excel.
Most developers try using generic PDF converters… and then discover that bank statements are in the top 1% of “PDFs that look structured but absolutely aren’t.”
I wanted to solve this in code.
In this article, you’ll learn:
• Why Chase PDFs are so uniquely hard to parse
• How structure-based format detection beats year-based detection
• How to infer column positions when the PDF has no headers
• How to merge split dates from fragmented PDF text items
• Real TypeScript code using pdfjs-dist
• Accuracy results from 23 real PDFs (2015–2025)
This is the article I wish existed before I started.
⸻
Part 1: Why Generic PDF Converters Fail on Bank Statements
After testing every major converter (PDFTables, Tabula, SmallPDF), I discovered four structural issues that make Chase PDFs uniquely problematic.
Let’s break them down.
⸻
Challenge 1: Multiple formats inside the SAME year
Chase used two formats simultaneously in 2024:
• v2 (2018–2024)
• v3 (2024–2025)
That means this detection method:
// ❌ WRONG: Year-based detection (breaks in 2024!)
function detectFormatWrong(year: number): 'v1' | 'v2' | 'v3' {
if (year < 2018) return 'v1';
if (year < 2024) return 'v2';
return 'v3';
}
…works fine until you get a February 2024 statement in v2 format and a May 2024 statement in v3 format.
Generic converters assume document consistency.
Chase does not.
⸻
Challenge 2: Missing column headers
Some Chase PDFs—especially early-2022 Business Checking—contain no column labels at all.
Just raw rows:
02/01 AMAZON PAYMENT $1,250.00 $15,840.32
No:
• DATE
• DESCRIPTION
• AMOUNT
• BALANCE
Generic table extractors rely on headers. Without them, they completely collapse.
⸻
Challenge 3: Variable column positions
Typical fixed-width parsers assume:
DATE DESC AMOUNT BALANCE
But Chase PDFs vary:
• DATE X position: anywhere from 30 to 70 pixels
• AMOUNT column: sometimes 2nd from right, sometimes 3rd
• BALANCE column: right-aligned but with different indentation per statement
• DESCRIPTION: can shift 40–80 pixels depending on layout
You cannot rely on static pixel positions. You must infer structure dynamically.
⸻
Challenge 4: Split dates across text items
PDF.js may return:
"0"
"2"
"/01"
instead of one "02/01".
Why?
Because Chase stores each glyph separately in the PDF.
Generic converters treat these as separate columns and produce output like:
0, 2, /01, AMAZON, PAYMENT, $1250.00
When fixed:
02/01, AMAZON PAYMENT, $1250.00
⸻
Real example:
❌ Generic PDF converter:
Row 1: 0, 2, /01, AMAZON PAYMENT, $1,250.00, ???
✅ After merging + heuristics:
Row 1: 02/01, AMAZON PAYMENT, $1,250.00, $15,840.32
⸻
Accuracy comparison (23 real PDFs):
Tool Accuracy Correct Wrong
Generic converters ~70% 802 321
Custom parser (pdfjs + TS) 99% 1,112 11
That’s 310 fewer errors—per 23 statements.
⸻
Part 2: Solution — Structure-Based Format Detection
The key insight:
Don’t detect PDF format by year. Detect it by TEXT SIGNATURES.
Chase formats have unique structural markers.
Once you read the full extracted text, you can reliably detect formats.
⸻
The 3 Chase formats
Format Years Columns Structure Unique Signature
v1 2015–2017 3 cols Simple list No section headers
v2 2018–2024 4 cols Transaction table "TRANSACTION DETAIL"
v3 2024–2025 3 cols Grouped by category "DEPOSITS AND ADDITIONS" + "TOTAL DEPOSITS"
⸻
Year-based detection (WRONG)
// ❌ breaks immediately in 2024
function detectFormatWrong(year: number): 'v1' | 'v2' | 'v3' {
if (year < 2018) return 'v1';
if (year < 2024) return 'v2';
return 'v3';
}
⸻
Structure-based detection (CORRECT)
// ✅ CORRECT: Structure-based detection
function detectChaseFormat(fullText: string): 'v1' | 'v2' | 'v3' {
// Priority 1: Check for v2 signature
if (fullText.includes('TRANSACTION DETAIL')) {
return 'v2';
}
// Priority 2: Check for v3 signature
if (fullText.includes('DEPOSITS AND ADDITIONS') &&
fullText.includes('TOTAL DEPOSITS')) {
return 'v3';
}
// Priority 3: Year-based fallback for old v1 format
const year = extractStatementYear(fullText);
if (year && year < 2018) {
return 'v1';
}
// Default: assume v2
return 'v2';
}
⸻
Why this works
• v2 always includes "TRANSACTION DETAIL"
• v3 always includes "DEPOSITS AND ADDITIONS" and "TOTAL DEPOSITS"
• v1 has none of these markers, so year fallback is safe
• Adding future formats becomes trivial: just add new signatures at top of the list
⸻
Real-world validation
I tested all 23 PDFs:
• v1: 1 file
• v2: 15 files
• v3: 7 files
Detection accuracy: 23/23 (100%).
This approach also works for:
• Business Checking
• Personal Banking
• PDFs during format transition periods (e.g., April–July 2024)
⸻
Part 3: Heuristic Column Detection for PDFs with NO Headers
Some Chase PDFs simply omit headers altogether.
You must infer columns dynamically.
The solution:
Infer column positions from the first transaction row using date heuristics.
⸻
Step-by-step algorithm
Identify first transaction using date pattern
• Look for MM/DD (02/01)
• In X range 30–70 (Chase always puts dates on left)Extract all text items on the same horizontal row
Use Y coordinate tolerance of ±5px.
Sort items left-to-right by X
Infer column meaning:
• leftmost → date
• center → description
• 2nd from right → amount
• rightmost → balance
These rules held across every tested statement.
⸻
Code: Column detection
interface ColumnPositions {
dateX: number;
descX: number;
amountX: number;
balanceX: number;
}
function inferColumnPositions(
textItems: PDFTextItem[]
): ColumnPositions {
// Step 1: Find the first transaction row
const firstDateItem = textItems.find(item =>
/^\d{2}\/\d{2}$/.test(item.str) &&
item.transform[4] >= 30 &&
item.transform[4] <= 70
);
if (!firstDateItem) {
throw new Error('Cannot find first transaction (no date pattern found)');
}
// Step 2: Extract row by Y position
const dateY = firstDateItem.transform[5];
const rowItems = textItems.filter(item =>
Math.abs(item.transform[5] - dateY) < 5
);
// Step 3: Sort left-to-right
const sortedByX = rowItems.sort((a, b) =>
a.transform[4] - b.transform[4]
);
// Step 4: Infer from positions
return {
dateX: sortedByX[0].transform[4],
descX: (sortedByX[0].transform[4] +
sortedByX[sortedByX.length - 1].transform[4]) / 2,
balanceX: sortedByX[sortedByX.length - 1].transform[4],
amountX: sortedByX[sortedByX.length - 2].transform[4]
};
}
⸻
Why this works
• Chase PDFs ALWAYS have date on the far left
• Balance is ALWAYS right-aligned
• Description always occupies the middle
• Amount is consistently next to balance
This works even with:
• v1 (3 columns)
• v2 (4 columns)
• v3 (3 columns + grouped sections)
⸻
Part 4: Handling Split Dates
pdfjs-dist often splits glyphs into separate items.
Example raw output:
"0"
"2"
"/"
"0"
"1"
You must merge items by proximity.
⸻
Core idea:
If two items’ X positions differ < 15px, they’re part of the same text value.
This was empirically tested across 23 PDFs.
⸻
Code: Merging split date fragments
function mergeSplitDates(items: PDFTextItem[]): PDFTextItem[] {
const merged: PDFTextItem[] = [];
let buffer = '';
let bufferX = 0;
for (let i = 0; i < items.length; i++) {
const item = items[i];
const nextItem = items[i + 1];
// Merge if close enough
if (nextItem &&
Math.abs(nextItem.transform[4] - item.transform[4]) < 15) {
buffer += item.str;
if (!bufferX) bufferX = item.transform[4];
} else {
merged.push({
str: buffer + item.str,
transform: [0, 0, 0, 0, bufferX || item.transform[4], item.transform[5]]
});
buffer = '';
bufferX = 0;
}
}
return merged;
}
⸻
Why 15px?
• < 10px missed some merges
• 20px caused accidental merges
• 15px was perfect across all documents
Result
❌ Before: ["0", "2", "/01", "AMAZON", "PAY", "MENT"]
✅ After: ["02/01", "AMAZON PAYMENT"]
You absolutely cannot build a reliable parser without this.
⸻
Part 5: Tech Stack & Architecture
Here’s the stack that worked reliably.
⸻
Core technologies
pdfjs-dist
• Same engine Firefox uses
• Extracts precise text positions (X/Y)
• Supports PDF 1.4–2.0
• Lightweight compared to OCR (no 200MB Tesseract install)
TypeScript
• Needed for complex PDF item types
• Prevents 90% of runtime errors
• Great autocomplete for pdfjs API
Node.js
• Fast enough for server-side parsing
• Can run heavy parsing without blocking UI
Bull + Redis
• Parallel PDF processing
• Retry logic
• Failure handling that generic HTTP handlers lack
ExcelJS
• Generates QuickBooks-ready Excel output
• Supports proper data validation + number formats
⸻
System Architecture Flow
User uploads PDF
↓
Backend creates Bull job
↓
Worker parses PDF with pdfjs-dist
↓
Detect format (v1/v2/v3)
↓
Merge split dates
↓
Infer column positions
↓
Extract rows into normalized structure
↓
Generate final Excel file (ExcelJS)
↓
Return download URL
⸻
Performance
• Average PDF: 5 seconds
• Largest tested PDF (273 transactions): 2 seconds
• Bottleneck: Excel generation, not PDF parsing
⸻
Part 6: Results & Lessons Learned
I tested the parser on a dataset of 23 real Chase PDF statements:
• Business + Personal
• 2015–2025 (10 years)
• Formats: v1, v2, v3
• Total rows: 1,123 transactions
⸻
Accuracy
Metric Generic Tools Custom Parser
Correct Transactions 802 1,112
Format Detection 33% 100%
Headerless PDFs Fail Pass
Split Date Handling Fail Pass
Total Accuracy ~71% 99%
⸻
What Worked
✔ Structure-based detection
✔ Heuristic column inference
✔ Split date merging
✔ Real-world testing (not synthetic PDFs)
✔ Using pdfjs-dist instead of OCR or regex-heavy hacks
⸻
What Didn’t Work
❌ Regex-only parsing
❌ Assuming headers always exist
❌ Fixed column positions
❌ Year-based format detection
❌ OCR — slow, inaccurate, unnecessary
⸻
Key Lessons Learned
- Test with real documents
Not all PDFs behave the same.
- Structure > content
Detect formats by text signatures, not by year.
- Use tolerance ranges, not precise numbers
Between PDFs, text shifts significantly.
- Merge text items aggressively
PDF.js fragments everything.
- Don’t try to “regex your way out”
Positional parsing beats text scrubbing every time.
⸻
Conclusion
Building a Chase Bank PDF parser taught me something unexpected:
PDFs are simple to read as humans and extremely complex to parse as machines.
Chase statements, in particular, combine:
• Multiple formats in the same year
• Missing headers
• Variable column alignment
• Fragmented text items
Generic converters assume too much structure.
To reach production-grade accuracy, you must infer structure dynamically.
The winning combination was:
• Structure-based format detection
• Heuristic column detection
• Split date merging
• pdfjs-dist + TypeScript
• Extensive testing on real PDFs
If you’re working with Chase PDFs and want to try a ready-made implementation, you can use https://bank-parser.com (free trial, no card required).
Have you built PDF parsers before?
What challenges did you face? I’d love to hear what approaches worked (or failed!) for you — share in the comments!
Top comments (0)