<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Baurzhan Zhetenov</title>
    <description>The latest articles on DEV Community by Baurzhan Zhetenov (@baurzhan_zhetenov_442c4cd).</description>
    <link>https://dev.to/baurzhan_zhetenov_442c4cd</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3044341%2F4cbe68a9-fc09-43cc-a2ca-07e94d8a4e5f.jpg</url>
      <title>DEV Community: Baurzhan Zhetenov</title>
      <link>https://dev.to/baurzhan_zhetenov_442c4cd</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/baurzhan_zhetenov_442c4cd"/>
    <language>en</language>
    <item>
      <title>How I Built a Chase Bank PDF Parser with 99% Accuracy</title>
      <dc:creator>Baurzhan Zhetenov</dc:creator>
      <pubDate>Thu, 20 Nov 2025 03:36:48 +0000</pubDate>
      <link>https://dev.to/baurzhan_zhetenov_442c4cd/how-i-built-a-chase-bank-pdf-parser-with-99-accuracy-4j6c</link>
      <guid>https://dev.to/baurzhan_zhetenov_442c4cd/how-i-built-a-chase-bank-pdf-parser-with-99-accuracy-4j6c</guid>
      <description>&lt;p&gt;Parsing PDFs sounds easy until you try parsing bank statements.&lt;/p&gt;

&lt;p&gt;I learned this the hard way.&lt;/p&gt;

&lt;p&gt;I spent nearly 2 months building a Chase Bank PDF parser that reaches 99% accuracy across 23 real statements (1,123 transactions total). Meanwhile, generic converters like Tabula or PDFTables only hit ~70% on the same documents.&lt;/p&gt;

&lt;p&gt;Here’s why Chase PDFs are much harder than you think—and how I solved the problems using TypeScript and pdfjs-dist, with real code you can copy.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Introduction&lt;/p&gt;

&lt;p&gt;If you’ve ever worked with U.S. banking data, you know that Chase Bank does something strange:&lt;br&gt;
They only let you download the last 18 months of transactions as CSV.&lt;/p&gt;

&lt;p&gt;CPAs, bookkeepers, and backend engineers quickly hit a wall when they need 5+ years of historical data. Chase provides those older statements only as PDFs—and the PDFs are absolutely not designed for machine parsing.&lt;/p&gt;

&lt;p&gt;Most accountants spend 45–60 minutes manually retyping each statement into QuickBooks or Excel.&lt;/p&gt;

&lt;p&gt;Most developers try using generic PDF converters… and then discover that bank statements are in the top 1% of “PDFs that look structured but absolutely aren’t.”&lt;/p&gt;

&lt;p&gt;I wanted to solve this in code.&lt;/p&gt;

&lt;p&gt;In this article, you’ll learn:&lt;br&gt;
    • Why Chase PDFs are so uniquely hard to parse&lt;br&gt;
    • How structure-based format detection beats year-based detection&lt;br&gt;
    • How to infer column positions when the PDF has no headers&lt;br&gt;
    • How to merge split dates from fragmented PDF text items&lt;br&gt;
    • Real TypeScript code using pdfjs-dist&lt;br&gt;
    • Accuracy results from 23 real PDFs (2015–2025)&lt;/p&gt;

&lt;p&gt;This is the article I wish existed before I started.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Part 1: Why Generic PDF Converters Fail on Bank Statements&lt;/p&gt;

&lt;p&gt;After testing every major converter (PDFTables, Tabula, SmallPDF), I discovered four structural issues that make Chase PDFs uniquely problematic.&lt;/p&gt;

&lt;p&gt;Let’s break them down.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Challenge 1: Multiple formats inside the SAME year&lt;/p&gt;

&lt;p&gt;Chase used two formats simultaneously in 2024:&lt;br&gt;
    • v2 (2018–2024)&lt;br&gt;
    • v3 (2024–2025)&lt;/p&gt;

&lt;p&gt;That means this detection method:&lt;/p&gt;

&lt;p&gt;// ❌ WRONG: Year-based detection (breaks in 2024!)&lt;br&gt;
function detectFormatWrong(year: number): 'v1' | 'v2' | 'v3' {&lt;br&gt;
  if (year &amp;lt; 2018) return 'v1';&lt;br&gt;
  if (year &amp;lt; 2024) return 'v2';&lt;br&gt;
  return 'v3';&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;…works fine until you get a February 2024 statement in v2 format and a May 2024 statement in v3 format.&lt;/p&gt;

&lt;p&gt;Generic converters assume document consistency.&lt;br&gt;
Chase does not.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Challenge 2: Missing column headers&lt;/p&gt;

&lt;p&gt;Some Chase PDFs—especially early-2022 Business Checking—contain no column labels at all.&lt;/p&gt;

&lt;p&gt;Just raw rows:&lt;/p&gt;

&lt;p&gt;02/01    AMAZON PAYMENT     $1,250.00     $15,840.32&lt;/p&gt;

&lt;p&gt;No:&lt;br&gt;
    • DATE&lt;br&gt;
    • DESCRIPTION&lt;br&gt;
    • AMOUNT&lt;br&gt;
    • BALANCE&lt;/p&gt;

&lt;p&gt;Generic table extractors rely on headers. Without them, they completely collapse.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Challenge 3: Variable column positions&lt;/p&gt;

&lt;p&gt;Typical fixed-width parsers assume:&lt;/p&gt;

&lt;p&gt;DATE      DESC      AMOUNT      BALANCE&lt;/p&gt;

&lt;p&gt;But Chase PDFs vary:&lt;br&gt;
    • DATE X position: anywhere from 30 to 70 pixels&lt;br&gt;
    • AMOUNT column: sometimes 2nd from right, sometimes 3rd&lt;br&gt;
    • BALANCE column: right-aligned but with different indentation per statement&lt;br&gt;
    • DESCRIPTION: can shift 40–80 pixels depending on layout&lt;/p&gt;

&lt;p&gt;You cannot rely on static pixel positions. You must infer structure dynamically.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Challenge 4: Split dates across text items&lt;/p&gt;

&lt;p&gt;PDF.js may return:&lt;/p&gt;

&lt;p&gt;"0"&lt;br&gt;
"2"&lt;br&gt;
"/01"&lt;/p&gt;

&lt;p&gt;instead of one "02/01".&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because Chase stores each glyph separately in the PDF.&lt;br&gt;
Generic converters treat these as separate columns and produce output like:&lt;/p&gt;

&lt;p&gt;0, 2, /01, AMAZON, PAYMENT, $1250.00&lt;/p&gt;

&lt;p&gt;When fixed:&lt;/p&gt;

&lt;p&gt;02/01, AMAZON PAYMENT, $1250.00&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Real example:&lt;/p&gt;

&lt;p&gt;❌ Generic PDF converter:&lt;br&gt;
Row 1: 0, 2, /01, AMAZON PAYMENT, $1,250.00, ???&lt;/p&gt;

&lt;p&gt;✅ After merging + heuristics:&lt;br&gt;
Row 1: 02/01, AMAZON PAYMENT, $1,250.00, $15,840.32&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Accuracy comparison (23 real PDFs):&lt;/p&gt;

&lt;p&gt;Tool    Accuracy    Correct Wrong&lt;br&gt;
Generic converters  ~70%    802 321&lt;br&gt;
Custom parser (pdfjs + TS)  99% 1,112   11&lt;/p&gt;

&lt;p&gt;That’s 310 fewer errors—per 23 statements.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Part 2: Solution — Structure-Based Format Detection&lt;/p&gt;

&lt;p&gt;The key insight:&lt;/p&gt;

&lt;p&gt;Don’t detect PDF format by year. Detect it by TEXT SIGNATURES.&lt;/p&gt;

&lt;p&gt;Chase formats have unique structural markers.&lt;br&gt;
Once you read the full extracted text, you can reliably detect formats.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The 3 Chase formats&lt;/p&gt;

&lt;p&gt;Format  Years   Columns Structure   Unique Signature&lt;br&gt;
v1  2015–2017 3 cols  Simple list No section headers&lt;br&gt;
v2  2018–2024 4 cols  Transaction table   "TRANSACTION DETAIL"&lt;br&gt;
v3  2024–2025 3 cols  Grouped by category "DEPOSITS AND ADDITIONS" + "TOTAL DEPOSITS"&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Year-based detection (WRONG)&lt;/p&gt;

&lt;p&gt;// ❌ breaks immediately in 2024&lt;br&gt;
function detectFormatWrong(year: number): 'v1' | 'v2' | 'v3' {&lt;br&gt;
  if (year &amp;lt; 2018) return 'v1';&lt;br&gt;
  if (year &amp;lt; 2024) return 'v2';&lt;br&gt;
  return 'v3';&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Structure-based detection (CORRECT)&lt;/p&gt;

&lt;p&gt;// ✅ CORRECT: Structure-based detection&lt;br&gt;
function detectChaseFormat(fullText: string): 'v1' | 'v2' | 'v3' {&lt;br&gt;
  // Priority 1: Check for v2 signature&lt;br&gt;
  if (fullText.includes('TRANSACTION DETAIL')) {&lt;br&gt;
    return 'v2';&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;// Priority 2: Check for v3 signature&lt;br&gt;
  if (fullText.includes('DEPOSITS AND ADDITIONS') &amp;amp;&amp;amp;&lt;br&gt;
      fullText.includes('TOTAL DEPOSITS')) {&lt;br&gt;
    return 'v3';&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;// Priority 3: Year-based fallback for old v1 format&lt;br&gt;
  const year = extractStatementYear(fullText);&lt;br&gt;
  if (year &amp;amp;&amp;amp; year &amp;lt; 2018) {&lt;br&gt;
    return 'v1';&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;// Default: assume v2&lt;br&gt;
  return 'v2';&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why this works&lt;br&gt;
    • v2 always includes "TRANSACTION DETAIL"&lt;br&gt;
    • v3 always includes "DEPOSITS AND ADDITIONS" and "TOTAL DEPOSITS"&lt;br&gt;
    • v1 has none of these markers, so year fallback is safe&lt;br&gt;
    • Adding future formats becomes trivial: just add new signatures at top of the list&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Real-world validation&lt;/p&gt;

&lt;p&gt;I tested all 23 PDFs:&lt;br&gt;
    • v1: 1 file&lt;br&gt;
    • v2: 15 files&lt;br&gt;
    • v3: 7 files&lt;/p&gt;

&lt;p&gt;Detection accuracy: 23/23 (100%).&lt;/p&gt;

&lt;p&gt;This approach also works for:&lt;br&gt;
    • Business Checking&lt;br&gt;
    • Personal Banking&lt;br&gt;
    • PDFs during format transition periods (e.g., April–July 2024)&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Part 3: Heuristic Column Detection for PDFs with NO Headers&lt;/p&gt;

&lt;p&gt;Some Chase PDFs simply omit headers altogether.&lt;br&gt;
You must infer columns dynamically.&lt;/p&gt;

&lt;p&gt;The solution:&lt;/p&gt;

&lt;p&gt;Infer column positions from the first transaction row using date heuristics.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Step-by-step algorithm&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Identify first transaction using date pattern&lt;br&gt;
• Look for MM/DD (02/01)&lt;br&gt;
• In X range 30–70 (Chase always puts dates on left)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Extract all text items on the same horizontal row&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Use Y coordinate tolerance of ±5px.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Sort items left-to-right by X&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Infer column meaning:&lt;br&gt;
• leftmost → date&lt;br&gt;
• center → description&lt;br&gt;
• 2nd from right → amount&lt;br&gt;
• rightmost → balance&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These rules held across every tested statement.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Code: Column detection&lt;/p&gt;

&lt;p&gt;interface ColumnPositions {&lt;br&gt;
  dateX: number;&lt;br&gt;
  descX: number;&lt;br&gt;
  amountX: number;&lt;br&gt;
  balanceX: number;&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;function inferColumnPositions(&lt;br&gt;
  textItems: PDFTextItem[]&lt;br&gt;
): ColumnPositions {&lt;br&gt;
  // Step 1: Find the first transaction row&lt;br&gt;
  const firstDateItem = textItems.find(item =&amp;gt;&lt;br&gt;
    /^\d{2}\/\d{2}$/.test(item.str) &amp;amp;&amp;amp;&lt;br&gt;&lt;br&gt;
    item.transform[4] &amp;gt;= 30 &amp;amp;&amp;amp;&lt;br&gt;&lt;br&gt;
    item.transform[4] &amp;lt;= 70&lt;br&gt;&lt;br&gt;
  );&lt;/p&gt;

&lt;p&gt;if (!firstDateItem) {&lt;br&gt;
    throw new Error('Cannot find first transaction (no date pattern found)');&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;// Step 2: Extract row by Y position&lt;br&gt;
  const dateY = firstDateItem.transform[5];&lt;br&gt;
  const rowItems = textItems.filter(item =&amp;gt;&lt;br&gt;
    Math.abs(item.transform[5] - dateY) &amp;lt; 5&lt;br&gt;
  );&lt;/p&gt;

&lt;p&gt;// Step 3: Sort left-to-right&lt;br&gt;
  const sortedByX = rowItems.sort((a, b) =&amp;gt;&lt;br&gt;
    a.transform[4] - b.transform[4]&lt;br&gt;
  );&lt;/p&gt;

&lt;p&gt;// Step 4: Infer from positions&lt;br&gt;
  return {&lt;br&gt;
    dateX: sortedByX[0].transform[4],&lt;br&gt;
    descX: (sortedByX[0].transform[4] +&lt;br&gt;
            sortedByX[sortedByX.length - 1].transform[4]) / 2,&lt;br&gt;
    balanceX: sortedByX[sortedByX.length - 1].transform[4],&lt;br&gt;
    amountX: sortedByX[sortedByX.length - 2].transform[4]&lt;br&gt;
  };&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why this works&lt;br&gt;
    • Chase PDFs ALWAYS have date on the far left&lt;br&gt;
    • Balance is ALWAYS right-aligned&lt;br&gt;
    • Description always occupies the middle&lt;br&gt;
    • Amount is consistently next to balance&lt;/p&gt;

&lt;p&gt;This works even with:&lt;br&gt;
    • v1 (3 columns)&lt;br&gt;
    • v2 (4 columns)&lt;br&gt;
    • v3 (3 columns + grouped sections)&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Part 4: Handling Split Dates&lt;/p&gt;

&lt;p&gt;pdfjs-dist often splits glyphs into separate items.&lt;/p&gt;

&lt;p&gt;Example raw output:&lt;/p&gt;

&lt;p&gt;"0"&lt;br&gt;
"2"&lt;br&gt;
"/"&lt;br&gt;
"0"&lt;br&gt;
"1"&lt;/p&gt;

&lt;p&gt;You must merge items by proximity.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Core idea:&lt;/p&gt;

&lt;p&gt;If two items’ X positions differ &amp;lt; 15px, they’re part of the same text value.&lt;/p&gt;

&lt;p&gt;This was empirically tested across 23 PDFs.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Code: Merging split date fragments&lt;/p&gt;

&lt;p&gt;function mergeSplitDates(items: PDFTextItem[]): PDFTextItem[] {&lt;br&gt;
  const merged: PDFTextItem[] = [];&lt;br&gt;
  let buffer = '';&lt;br&gt;
  let bufferX = 0;&lt;/p&gt;

&lt;p&gt;for (let i = 0; i &amp;lt; items.length; i++) {&lt;br&gt;
    const item = items[i];&lt;br&gt;
    const nextItem = items[i + 1];&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Merge if close enough
if (nextItem &amp;amp;&amp;amp;
    Math.abs(nextItem.transform[4] - item.transform[4]) &amp;lt; 15) {
  buffer += item.str;
  if (!bufferX) bufferX = item.transform[4];
} else {
  merged.push({
    str: buffer + item.str,
    transform: [0, 0, 0, 0, bufferX || item.transform[4], item.transform[5]]
  });
  buffer = '';
  bufferX = 0;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;}&lt;/p&gt;

&lt;p&gt;return merged;&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why 15px?&lt;br&gt;
    • &amp;lt; 10px missed some merges&lt;br&gt;
    • 20px caused accidental merges&lt;br&gt;
    • 15px was perfect across all documents&lt;/p&gt;

&lt;p&gt;Result&lt;/p&gt;

&lt;p&gt;❌ Before: ["0", "2", "/01", "AMAZON", "PAY", "MENT"]&lt;br&gt;
✅ After:  ["02/01", "AMAZON PAYMENT"]&lt;/p&gt;

&lt;p&gt;You absolutely cannot build a reliable parser without this.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Part 5: Tech Stack &amp;amp; Architecture&lt;/p&gt;

&lt;p&gt;Here’s the stack that worked reliably.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Core technologies&lt;/p&gt;

&lt;p&gt;pdfjs-dist&lt;br&gt;
    • Same engine Firefox uses&lt;br&gt;
    • Extracts precise text positions (X/Y)&lt;br&gt;
    • Supports PDF 1.4–2.0&lt;br&gt;
    • Lightweight compared to OCR (no 200MB Tesseract install)&lt;/p&gt;

&lt;p&gt;TypeScript&lt;br&gt;
    • Needed for complex PDF item types&lt;br&gt;
    • Prevents 90% of runtime errors&lt;br&gt;
    • Great autocomplete for pdfjs API&lt;/p&gt;

&lt;p&gt;Node.js&lt;br&gt;
    • Fast enough for server-side parsing&lt;br&gt;
    • Can run heavy parsing without blocking UI&lt;/p&gt;

&lt;p&gt;Bull + Redis&lt;br&gt;
    • Parallel PDF processing&lt;br&gt;
    • Retry logic&lt;br&gt;
    • Failure handling that generic HTTP handlers lack&lt;/p&gt;

&lt;p&gt;ExcelJS&lt;br&gt;
    • Generates QuickBooks-ready Excel output&lt;br&gt;
    • Supports proper data validation + number formats&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;System Architecture Flow&lt;/p&gt;

&lt;p&gt;User uploads PDF&lt;br&gt;
        ↓&lt;br&gt;
Backend creates Bull job&lt;br&gt;
        ↓&lt;br&gt;
Worker parses PDF with pdfjs-dist&lt;br&gt;
        ↓&lt;br&gt;
Detect format (v1/v2/v3)&lt;br&gt;
        ↓&lt;br&gt;
Merge split dates&lt;br&gt;
        ↓&lt;br&gt;
Infer column positions&lt;br&gt;
        ↓&lt;br&gt;
Extract rows into normalized structure&lt;br&gt;
        ↓&lt;br&gt;
Generate final Excel file (ExcelJS)&lt;br&gt;
        ↓&lt;br&gt;
Return download URL&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Performance&lt;br&gt;
    • Average PDF: 5 seconds&lt;br&gt;
    • Largest tested PDF (273 transactions): 2 seconds&lt;br&gt;
    • Bottleneck: Excel generation, not PDF parsing&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Part 6: Results &amp;amp; Lessons Learned&lt;/p&gt;

&lt;p&gt;I tested the parser on a dataset of 23 real Chase PDF statements:&lt;br&gt;
    • Business + Personal&lt;br&gt;
    • 2015–2025 (10 years)&lt;br&gt;
    • Formats: v1, v2, v3&lt;br&gt;
    • Total rows: 1,123 transactions&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Accuracy&lt;/p&gt;

&lt;p&gt;Metric  Generic Tools   Custom Parser&lt;br&gt;
Correct Transactions    802 1,112&lt;br&gt;
Format Detection    33% 100%&lt;br&gt;
Headerless PDFs Fail    Pass&lt;br&gt;
Split Date Handling Fail    Pass&lt;br&gt;
Total Accuracy  ~71%    99%&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What Worked&lt;/p&gt;

&lt;p&gt;✔ Structure-based detection&lt;br&gt;
✔ Heuristic column inference&lt;br&gt;
✔ Split date merging&lt;br&gt;
✔ Real-world testing (not synthetic PDFs)&lt;br&gt;
✔ Using pdfjs-dist instead of OCR or regex-heavy hacks&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What Didn’t Work&lt;/p&gt;

&lt;p&gt;❌ Regex-only parsing&lt;br&gt;
❌ Assuming headers always exist&lt;br&gt;
❌ Fixed column positions&lt;br&gt;
❌ Year-based format detection&lt;br&gt;
❌ OCR — slow, inaccurate, unnecessary&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Key Lessons Learned&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Test with real documents&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Not all PDFs behave the same.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Structure &amp;gt; content&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Detect formats by text signatures, not by year.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use tolerance ranges, not precise numbers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Between PDFs, text shifts significantly.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Merge text items aggressively&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;PDF.js fragments everything.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Don’t try to “regex your way out”&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Positional parsing beats text scrubbing every time.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Conclusion&lt;/p&gt;

&lt;p&gt;Building a Chase Bank PDF parser taught me something unexpected:&lt;/p&gt;

&lt;p&gt;PDFs are simple to read as humans and extremely complex to parse as machines.&lt;/p&gt;

&lt;p&gt;Chase statements, in particular, combine:&lt;br&gt;
    • Multiple formats in the same year&lt;br&gt;
    • Missing headers&lt;br&gt;
    • Variable column alignment&lt;br&gt;
    • Fragmented text items&lt;/p&gt;

&lt;p&gt;Generic converters assume too much structure.&lt;br&gt;
To reach production-grade accuracy, you must infer structure dynamically.&lt;/p&gt;

&lt;p&gt;The winning combination was:&lt;br&gt;
    • Structure-based format detection&lt;br&gt;
    • Heuristic column detection&lt;br&gt;
    • Split date merging&lt;br&gt;
    • pdfjs-dist + TypeScript&lt;br&gt;
    • Extensive testing on real PDFs&lt;/p&gt;

&lt;p&gt;If you’re working with Chase PDFs and want to try a ready-made implementation, you can use &lt;a href="https://bank-parser.com" rel="noopener noreferrer"&gt;https://bank-parser.com/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=tutorial_pdf&lt;/a&gt; (free trial, no card required).&lt;/p&gt;

&lt;p&gt;Have you built PDF parsers before?&lt;br&gt;
What challenges did you face? I’d love to hear what approaches worked (or failed!) for you — share in the comments!&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>typescript</category>
      <category>backend</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
