<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cal Mercer</title>
    <description>The latest articles on DEV Community by Cal Mercer (@cmsoxoa).</description>
    <link>https://dev.to/cmsoxoa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3769640%2F4fe8331d-556c-4d4f-b14e-dd1a0d21509d.png</url>
      <title>DEV Community: Cal Mercer</title>
      <link>https://dev.to/cmsoxoa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cmsoxoa"/>
    <language>en</language>
    <item>
      <title>Tax Document Parsing in 2026: 1099s, W-2s, and 1040s at Scale</title>
      <dc:creator>Cal Mercer</dc:creator>
      <pubDate>Thu, 12 Feb 2026 21:05:43 +0000</pubDate>
      <link>https://dev.to/cmsoxoa/tax-document-parsing-in-2026-1099s-w-2s-and-1040s-at-scale-26g1</link>
      <guid>https://dev.to/cmsoxoa/tax-document-parsing-in-2026-1099s-w-2s-and-1040s-at-scale-26g1</guid>
      <description>&lt;p&gt;Tax season hits different when you're processing thousands of documents for mortgage underwriting, income verification, or financial analysis. Here's what I learned building parsers for the big three tax documents.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Tax Documents
&lt;/h2&gt;

&lt;p&gt;Every tax document looks simple until you try to parse it at scale:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;W-2s&lt;/strong&gt;: Employers use different software (ADP, Gusto, Paychex, QuickBooks), each with slightly different layouts. Box positions drift. Multi-state filers get multiple copies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1099s&lt;/strong&gt;: There are literally 20+ variants (1099-INT, 1099-DIV, 1099-NEC, 1099-MISC, 1099-K...). Each has different fields. Brokerages love adding supplemental pages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1040s&lt;/strong&gt;: The IRS form itself is standardized, but schedules vary wildly. A simple return might be 2 pages. A complex one with K-1s and foreign accounts? 50+ pages.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works
&lt;/h2&gt;

&lt;p&gt;After processing millions of tax documents, here's the stack that scales:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Vision Models Beat Traditional OCR
&lt;/h3&gt;

&lt;p&gt;Forget Tesseract for tax docs. Vision models (GPT-4o, Claude) understand context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Traditional OCR sees: "12,345.67"
# Where is it? Box 1? Box 3? Who knows.
&lt;/span&gt;
&lt;span class="c1"&gt;# Vision model sees: "Box 1 Wages: $12,345.67"
# Context preserved.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The accuracy difference is night and day, especially for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handwritten corrections&lt;/li&gt;
&lt;li&gt;Low-quality scans&lt;/li&gt;
&lt;li&gt;Multi-column layouts&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Schema-Driven Extraction
&lt;/h3&gt;

&lt;p&gt;Don't ask the model to "extract everything." Define exactly what you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;w2Schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;employer_ein&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;box&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;b&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ein&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;employee_ssn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;box&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ssn&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;redact&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;wages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;box&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;currency&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;federal_tax&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;box&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;currency&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;social_security_wages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;box&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;3&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;currency&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="c1"&gt;// ... 20+ more fields&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches extraction errors early ("Box 1 can't be negative") and normalizes data across formats.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Multi-Document Correlation
&lt;/h3&gt;

&lt;p&gt;The real power comes from cross-referencing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;W-2 wages should roughly match 1040 Line 1a&lt;/li&gt;
&lt;li&gt;1099-NEC totals should appear on Schedule C or Schedule SE&lt;/li&gt;
&lt;li&gt;Multiple W-2s from same employer (state copies) should have consistent data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When they don't match? That's either fraud or a filing error. Both worth flagging.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fraud Angle
&lt;/h2&gt;

&lt;p&gt;Tax documents are prime targets for forgery. Common tells:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Font inconsistencies&lt;/strong&gt; - Real W-2s use specific fonts per software vendor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Box alignment&lt;/strong&gt; - Pixel-perfect alignment is suspicious (real forms have slight drift)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata mismatches&lt;/strong&gt; - PDF created in 2024 for a 2023 tax year? Red flag.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Round numbers&lt;/strong&gt; - Real wages are rarely exactly $50,000.00&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We built these checks into our &lt;a href="https://1099parser.com" rel="noopener noreferrer"&gt;1099 parser&lt;/a&gt;, &lt;a href="https://parsew2.com" rel="noopener noreferrer"&gt;W-2 parser&lt;/a&gt;, and &lt;a href="https://1040parser.com" rel="noopener noreferrer"&gt;1040 parser&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code Example: W-2 Extraction
&lt;/h2&gt;

&lt;p&gt;Here's the basic flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_w2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://parsew2.com/api/extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Validate the extraction
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wages&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Invalid wages amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;federal_tax&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wages&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tax withheld exceeds wages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance at Scale
&lt;/h2&gt;

&lt;p&gt;Numbers from production:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Document Type&lt;/th&gt;
&lt;th&gt;Avg Processing Time&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;W-2&lt;/td&gt;
&lt;td&gt;2.1s&lt;/td&gt;
&lt;td&gt;99.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1099-NEC&lt;/td&gt;
&lt;td&gt;1.8s&lt;/td&gt;
&lt;td&gt;99.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1040 (simple)&lt;/td&gt;
&lt;td&gt;3.2s&lt;/td&gt;
&lt;td&gt;98.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1040 (complex)&lt;/td&gt;
&lt;td&gt;8.5s&lt;/td&gt;
&lt;td&gt;97.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 1040 accuracy drops with complexity because Schedule K-1s are genuinely chaotic.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Build vs Buy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Build your own if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need custom validation rules&lt;/li&gt;
&lt;li&gt;You're processing 100k+ documents/month&lt;/li&gt;
&lt;li&gt;You have specific compliance requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use an API if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need to ship fast&lt;/li&gt;
&lt;li&gt;Volume is under 10k docs/month&lt;/li&gt;
&lt;li&gt;You want fraud detection included&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The build-vs-buy math changes around 50k docs/month, where API costs exceed a dedicated ML engineer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tax Season is Coming
&lt;/h2&gt;

&lt;p&gt;If you're in fintech, mortgage, or lending, you know what January-April looks like. The volume spike is brutal. Whatever solution you choose, load test it now.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We built specialized parsers for tax documents at &lt;a href="https://parsew2.com" rel="noopener noreferrer"&gt;parsew2.com&lt;/a&gt;, &lt;a href="https://1099parser.com" rel="noopener noreferrer"&gt;1099parser.com&lt;/a&gt;, and &lt;a href="https://1040parser.com" rel="noopener noreferrer"&gt;1040parser.com&lt;/a&gt;. They handle the edge cases so you don't have to.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>fintech</category>
      <category>ai</category>
      <category>python</category>
      <category>api</category>
    </item>
    <item>
      <title>The Hidden Complexity of Bank Statement Parsing (And How We Handle 500+ Formats)</title>
      <dc:creator>Cal Mercer</dc:creator>
      <pubDate>Thu, 12 Feb 2026 21:00:15 +0000</pubDate>
      <link>https://dev.to/cmsoxoa/the-hidden-complexity-of-bank-statement-parsing-and-how-we-handle-500-formats-24je</link>
      <guid>https://dev.to/cmsoxoa/the-hidden-complexity-of-bank-statement-parsing-and-how-we-handle-500-formats-24je</guid>
      <description>&lt;p&gt;Everyone thinks parsing a bank statement should be simple. It's just a list of transactions, right?&lt;/p&gt;

&lt;p&gt;Wrong.&lt;/p&gt;

&lt;p&gt;After building parsers for dozens of document types, bank statements remain one of the most deceptively complex. Here's what we learned handling 500+ different formats.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Format Explosion
&lt;/h2&gt;

&lt;p&gt;There are roughly 4,500 FDIC-insured banks in the US alone. Add credit unions, international banks, and neobanks, and you're looking at tens of thousands of institutions. Each one formats their statements differently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chase&lt;/strong&gt; uses a clean columnar layout.&lt;br&gt;
&lt;strong&gt;Bank of America&lt;/strong&gt; loves multi-page summaries before showing transactions.&lt;br&gt;
&lt;strong&gt;Wells Fargo&lt;/strong&gt; splits deposits and withdrawals into separate sections.&lt;br&gt;
&lt;strong&gt;Capital One&lt;/strong&gt; sometimes puts the date first, sometimes the description.&lt;/p&gt;

&lt;p&gt;And that's just the big guys. Regional banks and credit unions often have PDF layouts that look like they were designed in 1998 using Microsoft Publisher.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Template Matching Fails
&lt;/h2&gt;

&lt;p&gt;Our first approach was template matching. For each bank, we'd define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where the date column lives&lt;/li&gt;
&lt;li&gt;The format of amounts (with or without dollar signs, parentheses for negatives)&lt;/li&gt;
&lt;li&gt;How to identify the transaction type&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This worked for about 6 months. Then we hit three problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Banks update their statements&lt;/strong&gt; - Chase redesigned their PDF layout twice in one year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The long tail is brutal&lt;/strong&gt; - We'd get a statement from "First National Bank of Rural County" and have to build a new template&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same bank, different products&lt;/strong&gt; - A checking statement layout differs from a savings statement differs from a business account&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We were building 5-10 new templates per week. It wasn't sustainable.&lt;/p&gt;
&lt;h2&gt;
  
  
  The OCR Problem
&lt;/h2&gt;

&lt;p&gt;Raw OCR gives you text, but bank statements are fundamentally about &lt;em&gt;tables&lt;/em&gt;. The spatial relationship between columns matters.&lt;/p&gt;

&lt;p&gt;Consider this line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;02/15  AMAZON MARKETPLACE     -$47.99  $1,234.56
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OCR sees: &lt;code&gt;02/15 AMAZON MARKETPLACE -$47.99 $1,234.56&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;But which number is the transaction amount and which is the running balance? In some formats, the balance comes first. In others, it's not shown at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Breakthrough: Vision Models + Table Understanding
&lt;/h2&gt;

&lt;p&gt;Modern vision LLMs don't just read text. They understand layout. They can look at a bank statement and recognize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This is a table structure&lt;/li&gt;
&lt;li&gt;These are column headers (even if implicit)&lt;/li&gt;
&lt;li&gt;This row is a transaction&lt;/li&gt;
&lt;li&gt;This is a summary/total row (skip it)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture that works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDF → Image → Vision LLM → Table Extraction → Schema Validation → JSON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The schema is critical. We define exactly what we expect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"account"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"holder_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"account_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"routing_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"account_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"checking|savings|business"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"period"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"start_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"end_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"date"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transactions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"credit|debit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"running_balance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number|null"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"opening_balance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"closing_balance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_credits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_debits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Edge Cases That Will Break You
&lt;/h2&gt;

&lt;p&gt;Even with vision models, bank statements have edge cases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-page transactions&lt;/strong&gt; - A single transaction description can wrap across pages&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pending vs. posted&lt;/strong&gt; - Some statements show both, with different formatting&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Foreign currency&lt;/strong&gt; - Amount in USD vs. original currency, exchange rates&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interest calculations&lt;/strong&gt; - Daily balance tables that aren't transactions&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fees buried in descriptions&lt;/strong&gt; - "Monthly Service Fee" as a line item vs. as a deduction footnote&lt;/p&gt;

&lt;p&gt;We handle these with a combination of prompt engineering and post-processing validation. If the extracted transactions don't reconcile to the stated totals, we retry with more specific instructions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After 8 months of iteration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;96% accuracy&lt;/strong&gt; on transaction extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;500+ bank formats&lt;/strong&gt; supported without manual templates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New formats work automatically&lt;/strong&gt; (the vision model generalizes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing time&lt;/strong&gt;: 2-5 seconds per page&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The API
&lt;/h2&gt;

&lt;p&gt;We wrapped this into an API. Upload a bank statement PDF, get structured JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://statementocr.com/api/parse &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_API_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"file=@statement.pdf"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"account"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"holder_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"John Smith"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"account_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"****4567"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transactions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-02-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DIRECT DEPOSIT - ACME CORP"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;3500.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"credit"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-02-03"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AMAZON MARKETPLACE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-47.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"debit"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"opening_balance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1234.56&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"closing_balance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;4686.57&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Who's Using This?
&lt;/h2&gt;

&lt;p&gt;Three main use cases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Lending platforms&lt;/strong&gt; - Income verification without Plaid/bank linking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accounting software&lt;/strong&gt; - Auto-import statements for reconciliation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fraud detection&lt;/strong&gt; - Analyze spending patterns at scale&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The lending use case is huge. Not everyone wants to connect their bank account via OAuth. Some customers prefer uploading a PDF. And for businesses, bank statements are often the only option.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;If you're building anything that needs to understand bank statements, &lt;a href="https://statementocr.com" rel="noopener noreferrer"&gt;Statement OCR&lt;/a&gt; has a free tier. Upload a few statements and see the output.&lt;/p&gt;

&lt;p&gt;Works with most US banks out of the box. International support is improving.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part 2 of a series on document parsing. Previously: &lt;a href="https://dev.to/cmsoxoa/building-an-eob-parser-why-healthcare-documents-are-the-hardest-to-parse-122b"&gt;EOB parsing&lt;/a&gt;. Next: tax documents.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>fintech</category>
      <category>api</category>
      <category>ai</category>
      <category>automation</category>
    </item>
    <item>
      <title>Building an EOB Parser: Why Healthcare Documents Are the Hardest to Parse</title>
      <dc:creator>Cal Mercer</dc:creator>
      <pubDate>Thu, 12 Feb 2026 20:54:17 +0000</pubDate>
      <link>https://dev.to/cmsoxoa/building-an-eob-parser-why-healthcare-documents-are-the-hardest-to-parse-122b</link>
      <guid>https://dev.to/cmsoxoa/building-an-eob-parser-why-healthcare-documents-are-the-hardest-to-parse-122b</guid>
      <description>&lt;p&gt;I've built document parsers for tax forms, bank statements, and invoices. None of them prepared me for Explanation of Benefits documents.&lt;/p&gt;

&lt;p&gt;EOBs are the documents your health insurance sends after a medical visit. They explain what was billed, what insurance paid, and what you owe. Simple concept. Absolute nightmare to parse.&lt;/p&gt;

&lt;p&gt;Here's why - and how we eventually cracked it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with EOBs
&lt;/h2&gt;

&lt;p&gt;Every insurance company formats EOBs differently. Not just "slightly different layouts" - completely different information hierarchies, terminology, and structures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Blue Cross&lt;/strong&gt; puts the patient responsibility at the top.&lt;br&gt;
&lt;strong&gt;Aetna&lt;/strong&gt; buries it in a table on page 2.&lt;br&gt;
&lt;strong&gt;UnitedHealthcare&lt;/strong&gt; uses cryptic codes that require a separate decoder ring.&lt;br&gt;
&lt;strong&gt;Kaiser&lt;/strong&gt; somehow makes it even more confusing.&lt;/p&gt;

&lt;p&gt;And that's just the major payers. There are 900+ health insurance companies in the US, each with their own EOB format.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Traditional OCR Fails
&lt;/h2&gt;

&lt;p&gt;We tried Tesseract. It read the text fine but had no concept of what the text meant. A line like "Amount Billed: $450.00" and "Your Responsibility: $450.00" look similar to regex - but one is information, the other is what you actually owe.&lt;/p&gt;

&lt;p&gt;We tried template matching. It worked for about 3 weeks until Blue Cross updated their EOB layout and broke everything.&lt;/p&gt;

&lt;p&gt;We tried training custom models. The dataset problem is brutal - EOBs contain PHI (Protected Health Information), so you can't just scrape thousands of examples from the internet.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Breakthrough: Vision LLMs + Structured Output
&lt;/h2&gt;

&lt;p&gt;The solution came from treating this as a visual understanding problem, not a text extraction problem.&lt;/p&gt;

&lt;p&gt;Modern vision models (Claude, GPT-4V) can look at an EOB and actually &lt;em&gt;understand&lt;/em&gt; it the way a human does. They see the layout, recognize the patterns, and extract meaning.&lt;/p&gt;

&lt;p&gt;But raw LLM output is unreliable. You need structured output with validation.&lt;/p&gt;

&lt;p&gt;Here's the architecture that works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EOB Image → Vision LLM → JSON Schema Validation → Structured Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is the schema. We define exactly what fields we expect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"patient"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"member_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"group_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"claim"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"claim_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"date_of_service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"services"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cpt_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"billed_amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"allowed_amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"insurance_paid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"patient_responsibility"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"adjustment_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_billed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_allowed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_insurance_paid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_patient_responsibility"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The vision model extracts. The schema validates. Invalid responses get retried with more specific prompting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Results
&lt;/h2&gt;

&lt;p&gt;After 6 months of iteration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;94% accuracy&lt;/strong&gt; on patient responsibility amounts (the number that matters most)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;89% accuracy&lt;/strong&gt; on individual service line items&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Works across 50+ payer formats&lt;/strong&gt; without template updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The remaining errors are mostly edge cases: handwritten adjustments, multi-page EOBs where totals don't match line items, and the occasional payer that seems to actively obfuscate information.&lt;/p&gt;

&lt;h2&gt;
  
  
  The API
&lt;/h2&gt;

&lt;p&gt;We wrapped this into an API. Upload an EOB image, get structured JSON back.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://eobextractor.com/api/parse &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_API_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"file=@eob.pdf"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"patient"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Jane Smith"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"member_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"XYZ123456789"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"services"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cpt_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"99213"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Office visit, established patient"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"billed_amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;150.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"allowed_amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;89.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"insurance_paid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;71.60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"patient_responsibility"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;17.90&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_patient_responsibility"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;17.90&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Who's Using This?
&lt;/h2&gt;

&lt;p&gt;Three main use cases emerged:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Patient advocacy apps&lt;/strong&gt; - Help people understand what they actually owe&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare billing teams&lt;/strong&gt; - Reconcile EOBs against claims at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HSA/FSA platforms&lt;/strong&gt; - Auto-categorize healthcare expenses&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The healthcare billing space is massive and still surprisingly manual. We're seeing customers process thousands of EOBs per month that were previously hand-keyed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;If you're building anything that touches healthcare billing, &lt;a href="https://eobextractor.com" rel="noopener noreferrer"&gt;EOB Extractor&lt;/a&gt; has a free tier. Upload a few EOBs and see the output.&lt;/p&gt;

&lt;p&gt;The parser handles most payer formats out of the box. If you find one it struggles with, we'll add support - we're still improving coverage.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part of a series on document parsing. Next up: why bank statements are deceptively complex.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>healthcare</category>
      <category>api</category>
      <category>ai</category>
      <category>python</category>
    </item>
  </channel>
</rss>
