DEV Community

Baurzhan Zhetenov
Baurzhan Zhetenov

Posted on

How I Built a Chase Bank PDF Parser with 99% Accuracy

Parsing PDFs sounds easy until you try parsing bank statements.

I learned this the hard way.

I spent nearly 2 months building a Chase Bank PDF parser that reaches 99% accuracy across 23 real statements (1,123 transactions total). Meanwhile, generic converters like Tabula or PDFTables only hit ~70% on the same documents.

Here’s why Chase PDFs are much harder than you think—and how I solved the problems using TypeScript and pdfjs-dist, with real code you can copy.

Introduction

If you’ve ever worked with U.S. banking data, you know that Chase Bank does something strange:
They only let you download the last 18 months of transactions as CSV.

CPAs, bookkeepers, and backend engineers quickly hit a wall when they need 5+ years of historical data. Chase provides those older statements only as PDFs—and the PDFs are absolutely not designed for machine parsing.

Most accountants spend 45–60 minutes manually retyping each statement into QuickBooks or Excel.

Most developers try using generic PDF converters… and then discover that bank statements are in the top 1% of “PDFs that look structured but absolutely aren’t.”

I wanted to solve this in code.

In this article, you’ll learn:
• Why Chase PDFs are so uniquely hard to parse
• How structure-based format detection beats year-based detection
• How to infer column positions when the PDF has no headers
• How to merge split dates from fragmented PDF text items
• Real TypeScript code using pdfjs-dist
• Accuracy results from 23 real PDFs (2015–2025)

This is the article I wish existed before I started.

Part 1: Why Generic PDF Converters Fail on Bank Statements

After testing every major converter (PDFTables, Tabula, SmallPDF), I discovered four structural issues that make Chase PDFs uniquely problematic.

Let’s break them down.

Challenge 1: Multiple formats inside the SAME year

Chase used two formats simultaneously in 2024:
• v2 (2018–2024)
• v3 (2024–2025)

That means this detection method:

// ❌ WRONG: Year-based detection (breaks in 2024!)
function detectFormatWrong(year: number): 'v1' | 'v2' | 'v3' {
if (year < 2018) return 'v1';
if (year < 2024) return 'v2';
return 'v3';
}

…works fine until you get a February 2024 statement in v2 format and a May 2024 statement in v3 format.

Generic converters assume document consistency.
Chase does not.

Challenge 2: Missing column headers

Some Chase PDFs—especially early-2022 Business Checking—contain no column labels at all.

Just raw rows:

02/01 AMAZON PAYMENT $1,250.00 $15,840.32

No:
• DATE
• DESCRIPTION
• AMOUNT
• BALANCE

Generic table extractors rely on headers. Without them, they completely collapse.

Challenge 3: Variable column positions

Typical fixed-width parsers assume:

DATE DESC AMOUNT BALANCE

But Chase PDFs vary:
• DATE X position: anywhere from 30 to 70 pixels
• AMOUNT column: sometimes 2nd from right, sometimes 3rd
• BALANCE column: right-aligned but with different indentation per statement
• DESCRIPTION: can shift 40–80 pixels depending on layout

You cannot rely on static pixel positions. You must infer structure dynamically.

Challenge 4: Split dates across text items

PDF.js may return:

"0"
"2"
"/01"

instead of one "02/01".

Why?

Because Chase stores each glyph separately in the PDF.
Generic converters treat these as separate columns and produce output like:

0, 2, /01, AMAZON, PAYMENT, $1250.00

When fixed:

02/01, AMAZON PAYMENT, $1250.00

Real example:

❌ Generic PDF converter:
Row 1: 0, 2, /01, AMAZON PAYMENT, $1,250.00, ???

✅ After merging + heuristics:
Row 1: 02/01, AMAZON PAYMENT, $1,250.00, $15,840.32

Accuracy comparison (23 real PDFs):

Tool Accuracy Correct Wrong
Generic converters ~70% 802 321
Custom parser (pdfjs + TS) 99% 1,112 11

That’s 310 fewer errors—per 23 statements.

Part 2: Solution — Structure-Based Format Detection

The key insight:

Don’t detect PDF format by year. Detect it by TEXT SIGNATURES.

Chase formats have unique structural markers.
Once you read the full extracted text, you can reliably detect formats.

The 3 Chase formats

Format Years Columns Structure Unique Signature
v1 2015–2017 3 cols Simple list No section headers
v2 2018–2024 4 cols Transaction table "TRANSACTION DETAIL"
v3 2024–2025 3 cols Grouped by category "DEPOSITS AND ADDITIONS" + "TOTAL DEPOSITS"

Year-based detection (WRONG)

// ❌ breaks immediately in 2024
function detectFormatWrong(year: number): 'v1' | 'v2' | 'v3' {
if (year < 2018) return 'v1';
if (year < 2024) return 'v2';
return 'v3';
}

Structure-based detection (CORRECT)

// ✅ CORRECT: Structure-based detection
function detectChaseFormat(fullText: string): 'v1' | 'v2' | 'v3' {
// Priority 1: Check for v2 signature
if (fullText.includes('TRANSACTION DETAIL')) {
return 'v2';
}

// Priority 2: Check for v3 signature
if (fullText.includes('DEPOSITS AND ADDITIONS') &&
fullText.includes('TOTAL DEPOSITS')) {
return 'v3';
}

// Priority 3: Year-based fallback for old v1 format
const year = extractStatementYear(fullText);
if (year && year < 2018) {
return 'v1';
}

// Default: assume v2
return 'v2';
}

Why this works
• v2 always includes "TRANSACTION DETAIL"
• v3 always includes "DEPOSITS AND ADDITIONS" and "TOTAL DEPOSITS"
• v1 has none of these markers, so year fallback is safe
• Adding future formats becomes trivial: just add new signatures at top of the list

Real-world validation

I tested all 23 PDFs:
• v1: 1 file
• v2: 15 files
• v3: 7 files

Detection accuracy: 23/23 (100%).

This approach also works for:
• Business Checking
• Personal Banking
• PDFs during format transition periods (e.g., April–July 2024)

Part 3: Heuristic Column Detection for PDFs with NO Headers

Some Chase PDFs simply omit headers altogether.
You must infer columns dynamically.

The solution:

Infer column positions from the first transaction row using date heuristics.

Step-by-step algorithm

  1. Identify first transaction using date pattern
    • Look for MM/DD (02/01)
    • In X range 30–70 (Chase always puts dates on left)

  2. Extract all text items on the same horizontal row

Use Y coordinate tolerance of ±5px.

  1. Sort items left-to-right by X

  2. Infer column meaning:
    • leftmost → date
    • center → description
    • 2nd from right → amount
    • rightmost → balance

These rules held across every tested statement.

Code: Column detection

interface ColumnPositions {
dateX: number;
descX: number;
amountX: number;
balanceX: number;
}

function inferColumnPositions(
textItems: PDFTextItem[]
): ColumnPositions {
// Step 1: Find the first transaction row
const firstDateItem = textItems.find(item =>
/^\d{2}\/\d{2}$/.test(item.str) &&

item.transform[4] >= 30 &&

item.transform[4] <= 70

);

if (!firstDateItem) {
throw new Error('Cannot find first transaction (no date pattern found)');
}

// Step 2: Extract row by Y position
const dateY = firstDateItem.transform[5];
const rowItems = textItems.filter(item =>
Math.abs(item.transform[5] - dateY) < 5
);

// Step 3: Sort left-to-right
const sortedByX = rowItems.sort((a, b) =>
a.transform[4] - b.transform[4]
);

// Step 4: Infer from positions
return {
dateX: sortedByX[0].transform[4],
descX: (sortedByX[0].transform[4] +
sortedByX[sortedByX.length - 1].transform[4]) / 2,
balanceX: sortedByX[sortedByX.length - 1].transform[4],
amountX: sortedByX[sortedByX.length - 2].transform[4]
};
}

Why this works
• Chase PDFs ALWAYS have date on the far left
• Balance is ALWAYS right-aligned
• Description always occupies the middle
• Amount is consistently next to balance

This works even with:
• v1 (3 columns)
• v2 (4 columns)
• v3 (3 columns + grouped sections)

Part 4: Handling Split Dates

pdfjs-dist often splits glyphs into separate items.

Example raw output:

"0"
"2"
"/"
"0"
"1"

You must merge items by proximity.

Core idea:

If two items’ X positions differ < 15px, they’re part of the same text value.

This was empirically tested across 23 PDFs.

Code: Merging split date fragments

function mergeSplitDates(items: PDFTextItem[]): PDFTextItem[] {
const merged: PDFTextItem[] = [];
let buffer = '';
let bufferX = 0;

for (let i = 0; i < items.length; i++) {
const item = items[i];
const nextItem = items[i + 1];

// Merge if close enough
if (nextItem &&
    Math.abs(nextItem.transform[4] - item.transform[4]) < 15) {
  buffer += item.str;
  if (!bufferX) bufferX = item.transform[4];
} else {
  merged.push({
    str: buffer + item.str,
    transform: [0, 0, 0, 0, bufferX || item.transform[4], item.transform[5]]
  });
  buffer = '';
  bufferX = 0;
}
Enter fullscreen mode Exit fullscreen mode

}

return merged;
}

Why 15px?
• < 10px missed some merges
• 20px caused accidental merges
• 15px was perfect across all documents

Result

❌ Before: ["0", "2", "/01", "AMAZON", "PAY", "MENT"]
✅ After: ["02/01", "AMAZON PAYMENT"]

You absolutely cannot build a reliable parser without this.

Part 5: Tech Stack & Architecture

Here’s the stack that worked reliably.

Core technologies

pdfjs-dist
• Same engine Firefox uses
• Extracts precise text positions (X/Y)
• Supports PDF 1.4–2.0
• Lightweight compared to OCR (no 200MB Tesseract install)

TypeScript
• Needed for complex PDF item types
• Prevents 90% of runtime errors
• Great autocomplete for pdfjs API

Node.js
• Fast enough for server-side parsing
• Can run heavy parsing without blocking UI

Bull + Redis
• Parallel PDF processing
• Retry logic
• Failure handling that generic HTTP handlers lack

ExcelJS
• Generates QuickBooks-ready Excel output
• Supports proper data validation + number formats

System Architecture Flow

User uploads PDF

Backend creates Bull job

Worker parses PDF with pdfjs-dist

Detect format (v1/v2/v3)

Merge split dates

Infer column positions

Extract rows into normalized structure

Generate final Excel file (ExcelJS)

Return download URL

Performance
• Average PDF: 5 seconds
• Largest tested PDF (273 transactions): 2 seconds
• Bottleneck: Excel generation, not PDF parsing

Part 6: Results & Lessons Learned

I tested the parser on a dataset of 23 real Chase PDF statements:
• Business + Personal
• 2015–2025 (10 years)
• Formats: v1, v2, v3
• Total rows: 1,123 transactions

Accuracy

Metric Generic Tools Custom Parser
Correct Transactions 802 1,112
Format Detection 33% 100%
Headerless PDFs Fail Pass
Split Date Handling Fail Pass
Total Accuracy ~71% 99%

What Worked

✔ Structure-based detection
✔ Heuristic column inference
✔ Split date merging
✔ Real-world testing (not synthetic PDFs)
✔ Using pdfjs-dist instead of OCR or regex-heavy hacks

What Didn’t Work

❌ Regex-only parsing
❌ Assuming headers always exist
❌ Fixed column positions
❌ Year-based format detection
❌ OCR — slow, inaccurate, unnecessary

Key Lessons Learned

  1. Test with real documents

Not all PDFs behave the same.

  1. Structure > content

Detect formats by text signatures, not by year.

  1. Use tolerance ranges, not precise numbers

Between PDFs, text shifts significantly.

  1. Merge text items aggressively

PDF.js fragments everything.

  1. Don’t try to “regex your way out”

Positional parsing beats text scrubbing every time.

Conclusion

Building a Chase Bank PDF parser taught me something unexpected:

PDFs are simple to read as humans and extremely complex to parse as machines.

Chase statements, in particular, combine:
• Multiple formats in the same year
• Missing headers
• Variable column alignment
• Fragmented text items

Generic converters assume too much structure.
To reach production-grade accuracy, you must infer structure dynamically.

The winning combination was:
• Structure-based format detection
• Heuristic column detection
• Split date merging
• pdfjs-dist + TypeScript
• Extensive testing on real PDFs

If you’re working with Chase PDFs and want to try a ready-made implementation, you can use https://bank-parser.com (free trial, no card required).

Have you built PDF parsers before?
What challenges did you face? I’d love to hear what approaches worked (or failed!) for you — share in the comments!

Top comments (0)