DEV Community: Martin

How to Convert Scanned Bank Statements to Excel (When Your Client Sends a Photo)

Martin — Tue, 02 Jun 2026 17:30:24 +0000

Every accountant and bookkeeper has been there: a client sends their bank statement as a photo taken with their phone, or a PDF that's actually just a scanned image with no text layer. You open it in Excel, try the built-in data import, and get absolutely nothing — or worse, a garbled mess.

This guide covers the hard case specifically: image-based bank statements. Native PDFs (where you can highlight and copy text) are straightforward. It's the scanned and photographed documents that eat time and introduce errors.

Why Standard Methods Fail on Scanned Documents

When a PDF contains actual text (a "native" or "text-layer" PDF), most conversion tools work fine. Excel's Get & Transform feature, Adobe Acrobat's export, and free tools like Smallpdf or iLovePDF all rely on extracting that embedded text.

Scanned documents have no embedded text. They're photographs. When you try to "convert" them with a basic tool, the software either returns nothing or produces garbled output — it's treating a pixel image as if it were formatted data.

What also doesn't work

Copy-paste from Adobe Acrobat. If there's no text layer, you copy blank space. Adobe will sometimes generate a synthetic text layer via OCR, but the results are unreliable for tables — amounts end up in description columns, rows merge.

Python libraries (Tabula, Camelot, pdfplumber). All three extract text from PDFs. None of them perform OCR. camelot.read_pdf("bank_statement.pdf") on a scanned document returns an empty table. Developers frequently discover this after building a pipeline that silently produces zero rows on real-world input.

Excel Power Query → From PDF. Microsoft's built-in importer also requires a text layer. It errors or returns empty on scanned files.

Google Drive "Open with Docs." Reasonable OCR for single-column text, poor for tables. Transaction rows get merged, amounts migrate to description columns, multi-line descriptions fragment incorrectly.

What OCR Actually Does (And Why Table Structure Is the Hard Part)

OCR (Optical Character Recognition) converts pixel patterns into characters. The challenge for bank statements isn't reading individual characters — modern OCR handles that reliably. The challenge is understanding table structure.

A bank statement has columns for Date, Description, Debit, Credit, and Balance. Amounts right-align. Descriptions sometimes wrap across two lines. The column boundary between Description and Debit isn't marked by any visible separator — it's inferred from spatial positioning.

Naive OCR reads left-to-right, top-to-bottom, producing a stream of text with no knowledge that "120.50" belongs in the Credit column of a specific row. AI-powered tools add a second layer: table detection and reconstruction. They identify column boundaries, infer row groupings, and output a structured spreadsheet rather than a text dump.

Step-by-Step: Converting a Photographed Bank Statement with PDFExcel

PDFExcel is built for messy documents — including photos and scans. Here's the workflow:

1. Upload your file. Drag it to the upload area. PDFExcel accepts PDFs (including image-only PDFs with no text layer) and also direct image files (JPG, PNG). If your client sent a phone photo, upload it directly — no PDF conversion needed first.

2. Let it run. Processing takes 20–60 seconds for a typical bank statement. The AI identifies table structure, reads amounts including decimals and thousands separators, handles bold header rows differently from data rows, and preserves column alignment.

3. Review the preview. Before downloading, you get a preview of the extracted data. Spot-check: do the column headers look right? Are the first few transactions correct? Verify an amount you can confirm — usually the closing balance.

4. Download as Excel or CSV. The .xlsx output has proper column headers, numeric values (not text-formatted numbers), and dates in a sortable format. Load it into your bookkeeping software or reconciliation workflow.

Getting Better Results from Low-Quality Scans

Not all scanned documents are equal. A few things make a real difference:

Photo quality matters more than you'd expect. The gap between a clear photo (good lighting, no shadows, camera directly above the page) and a skewed phone photo can mean the difference between 99% accuracy and 80%. If a client's statements consistently arrive as poor-quality photos, it's worth asking them to use their phone's built-in document scanner — iOS's built-in scanner and Google's PhotoScan both auto-straighten and sharpen the image before sending.

Straighten before uploading. If a document is noticeably rotated or perspective-distorted, correct it first. Most phones have this in the Camera app's document scan mode. This single step eliminates most structural extraction errors.

Multi-page statements. Upload the whole document — don't split a 12-month statement into 12 separate files. PDFExcel handles multi-page PDFs as one document and produces a single consolidated spreadsheet.

Always validate against the closing balance. Sum the extracted transactions and reconcile. A discrepancy immediately flags rows that were missed or misread. This step catches the 1–2% of cases where a low-res scan produces an error.

Scanned vs. Native PDFs: How to Tell Them Apart

If you're not sure whether a file is image-based or has a text layer:

Open it in any PDF viewer and try to highlight and copy text. If individual characters are selectable, it's native.
In Chrome: press Ctrl+F and search for a word you can see. Zero results = no text layer.
In Adobe Acrobat: Tools → Edit PDF. If nothing is selectable, it's image-only.

Native PDFs are easier — most tools handle them. But if you're processing a mix (common when clients send a combination of downloaded statements and scanned archives), using one tool that handles both is simpler than maintaining two workflows.

When to Use What

Document type	Recommended approach
Native PDF (text highlightable)	Any standard converter, or Excel's built-in Get & Transform
Scanned PDF, clear scan	PDFExcel, Nanonets, DocuClipper
Phone photo of statement	PDFExcel (accepts image files directly)
Very low-res or damaged scan	Re-photograph or rescan first; then AI converter
Batch processing 100+ statements/month	PDFExcel API or a dedicated automation platform

The Bottom Line

Scanned and photographed bank statements are a genuine pain point in accounting workflows — and the tools most accountants reach for first (Acrobat, Excel's Get & Transform, free online converters) all fail on image-based PDFs. An AI tool with proper OCR and table structure detection handles them cleanly.

I use PDFExcel specifically for client documents that arrive as photos or old scans. The free plan covers 10 documents per month, which works for occasional use; the Standard plan at $69/month makes sense if you're processing statements regularly.

The workflow: upload, preview, download. Five minutes instead of an hour of manual entry — and no transcription errors introduced along the way.

DocuClipper vs PDFExcel vs LedgerDocs for Accountants: Which PDF-to-Excel Tool Fits Your Workflow?

Martin — Mon, 01 Jun 2026 18:00:30 +0000

If you work in accounting or bookkeeping, you have probably hit the same wall: a client sends you a PDF — a bank statement, an invoice stack, a set of 1099s — and you need that data in Excel before you can do anything useful. The question is which tool to trust with it.

Three names come up repeatedly in accountant and bookkeeper forums in 2026: DocuClipper, PDFExcel, and LedgerDocs. They target the same use case but approach it differently. This is an honest comparison of where each one fits and where each one fails.

The short version

	DocuClipper	PDFExcel	LedgerDocs
Best for	Bank statements, 1099s, high-volume financial docs	Mixed document types, scanned/photographed PDFs	Accounting workflow integration, team use
Scanned PDFs	Yes (OCR)	Yes (AI OCR)	Yes (OCR)
No templates	Yes	Yes	Yes
Free tier	No (trial only)	Yes — 10 docs/month	No (trial only)
Pricing	From ~$39/month	From $0; Standard $69/month	From ~$49/month
Output formats	Excel, CSV, QBO, OFX, Xero	Excel, CSV, Google Sheets	Excel, CSV
Reconciliation built-in	Yes	No	Limited

If you process bank statements at volume for bookkeeping clients, DocuClipper is probably the right answer. If your documents are a mixed bag — some bank statements, some invoices, some photographed receipts — PDFExcel's broader scope may be more practical. If your team uses cloud accounting software and wants documents to flow directly into that workflow, LedgerDocs is worth evaluating.

The longer answer follows.

DocuClipper: best-in-class for financial statements

DocuClipper has one clear strength: it was built specifically for financial documents that accountants handle at volume. Its bank statement processing is genuinely impressive — it reads transaction dates, amounts, descriptions, and running balances from any US or international bank without needing a template for that specific bank's format.

What it does well:

Bank statements from any bank, digital or scanned. DocuClipper runs a reconciliation check on every statement — summing the extracted transactions and comparing against the opening and closing balances to flag discrepancies before you see the output. This is a meaningful safety net.
1099 extraction. It handles 1099-NEC, 1099-MISC, 1099-INT, 1099-DIV, and composite brokerage 1099-Bs, outputting each form as a row with named columns and recipient TINs maskable for privacy.
Export directly to QuickBooks (QBO/QFX), Xero, and OFX — formats that most competitors don't support.

Where it's limited:

The pricing is structured for volume. There is no free tier — you get a trial, then choose a plan. For occasional use or small practices, this may not be cost-effective.
It is narrowly optimized for financial documents. If you need to extract data from a general-purpose contract, a technical report, or a non-standard document type, DocuClipper is not the right tool — its extraction engine is tuned for known financial form layouts, not arbitrary documents.
The UI is functional but not particularly fast for one-off jobs. It is designed for batch workflow, not quick single-document lookups.

Best for: Bookkeepers and CPA firms processing significant volume of bank statements, 1099s, and QuickBooks-integrated workflows.

PDFExcel: widest document-type coverage

PDFExcel takes a different approach. Rather than optimizing for one document category, it uses a general-purpose AI model trained across document types — bank statements, invoices, receipts, purchase orders, financial reports, and more. The core product pitch is: upload any PDF, including scanned or photographed documents, and get structured Excel or CSV output without configuring templates.

What it does well:

Scanned and photographed documents. This is the category where PDFExcel most clearly differentiates itself. A client who emails a photo of their bank statement taken on their phone — often with the page not quite flat, lighting uneven — is a use case that template-based tools handle poorly. PDFExcel's OCR handles this class of input without special configuration.
No-template operation across document types. If your client base is varied — some use Chase, some use local credit unions, some send Amex statements, some send foreign-bank PDFs — not needing to configure a new template for each one is a real time saver.
The free tier covers 10 documents per month. For accountants who want to test the tool on their actual file types before committing, or whose volume is modest, this is a meaningful entry point.
Speed for individual documents. Upload, wait 15–25 seconds, download. Fast for one-off jobs.

Where it's limited:

There is no built-in reconciliation check. Unlike DocuClipper, which verifies that your extracted transactions sum to the statement's opening and closing balances, PDFExcel gives you the extraction — verifying it is your job.
No direct integration with QuickBooks or Xero. Output is Excel, CSV, or Google Sheets. If your workflow requires QBO files, you will need an extra conversion step.
For very high volumes (hundreds of statements per month), DocuClipper's batch reconciliation workflow is more purpose-built.

Best for: Accounting practices that work with varied document types and clients who send scanned or photographed PDFs; practitioners who want a free tier for occasional use; any situation where the document isn't a standard bank statement or 1099.

LedgerDocs: accounting workflow integration

LedgerDocs positions itself as a document management and conversion hub for bookkeepers and accountants rather than a pure PDF converter. Its conversion quality is comparable to the other two for standard digital PDFs, and it handles scanned documents with OCR.

What it does well:

Team collaboration features. LedgerDocs is built for accounting firms where multiple staff members handle the same client's documents. Sharing, reviewing, and organizing extracted data within a team is smoother than in either DocuClipper or PDFExcel.
Cloud storage integration. Documents can be organized by client, period, or document type within LedgerDocs rather than just downloaded locally.
Solid handling of receipts and expense documents alongside bank statements — it positions itself as a full document hub, not just a converter.

Where it's limited:

Fewer output formats than DocuClipper. No QBO or Xero export; output is Excel or CSV.
The per-document accuracy on complex bank statements is good but not quite at DocuClipper's level for high-volume financial statement work.
Pricing positions it toward small-to-mid-size accounting firms, not individual bookkeepers.

Best for: Accounting firms that want document management alongside conversion, and value team workflow and client organization over raw extraction volume or format flexibility.

How to choose

You process mostly bank statements and 1099s at volume → DocuClipper. The reconciliation check alone is worth the premium for high-volume statement processing, and the QBO/Xero export eliminates a conversion step.

Your clients send scanned documents and photos, or your document types are mixed → PDFExcel. The free tier lets you test it against your actual files. The AI OCR handles low-quality scans that template-based tools fumble. If you are not sure which tool works for your specific document types, start here — the 10-document free tier is a real evaluation opportunity, not a token trial.

You run a multi-person accounting firm and want document workflow, not just conversion → LedgerDocs. The collaboration and client-organization features justify the difference over a pure conversion tool.

You are not sure? The practical test: take your five most representative client documents — the ones that currently take the most time to deal with — and run each tool against them. The quality differences that matter are in edge cases: unusual bank formats, photos taken in poor light, 40-page brokerage composites. Generic well-formatted PDFs all convert fine. The test is the hard ones.

Bottom line

DocuClipper, PDFExcel, and LedgerDocs each solve the same problem with different emphases. They are not interchangeable, and the right choice depends more on your document mix and workflow than on which one has the higher G2 rating.

For any accounting practice dealing with photographed or scanned documents from smaller or international banks, a tool with genuine AI OCR — like PDFExcel — will save more time than one optimized purely for known financial form layouts. For high-volume bank statement processing from standard US banks with reconciliation built in, DocuClipper earns its reputation. For team workflow, LedgerDocs fits a need the other two don't fully address.

The 10 documents/month free tier on PDFExcel is the lowest-friction way to find out if it handles your specific document types before spending anything.

Author note: I tested all three tools against a mix of bank statements (digital and photographed), a 1099-B brokerage composite, and a set of vendor invoices. PDFExcel handled the photographed documents and non-standard formats most consistently; DocuClipper was faster for the high-volume bank statement workflow where reconciliation mattered; LedgerDocs showed its value in the client-organization layer, not the raw extraction.

How to Convert Bank Statement PDFs to Excel: A Bookkeeper's Complete Guide

Martin — Sun, 31 May 2026 16:24:27 +0000

If you do bookkeeping for clients, you have encountered this scenario: a client sends you their bank statement as a PDF — sometimes a downloaded statement, sometimes a photo taken on their phone — and you need every transaction in Excel before you can start reconciling.

Copy-and-paste works for one page. For a 12-month statement with 400 transactions, it takes a morning. For a client who uses three different banks, it takes longer than the actual bookkeeping.

This guide covers every practical method for getting bank statement PDFs into Excel in 2026, from built-in Excel tools to AI converters, including the specific failure modes that trip up each approach.

Why bank statement PDFs are harder than other PDFs

Not all PDFs are equal. A PDF generated by accounting software embeds real text — copy-paste works fine. A bank statement is different for two reasons:

First, the layout is inconsistent. Every bank has a proprietary format: some put the date before the description, others after. Some banks include a running balance column; others don't. Reference numbers appear in different positions. A tool trained to parse one bank's format will fail silently on another bank's.

Second, scanned statements aren't text at all. If your client downloaded a statement that was originally generated as an image (common with older statements or smaller banks), or photographed a paper statement, the file contains no embedded text — just pixels. Standard PDF-to-Excel tools extract nothing useful from these.

The methods below are ordered by how well they handle both problems.

Method 1: Microsoft Excel's built-in PDF import

Excel for Microsoft 365 can import PDF tables directly via Data → Get Data → From File → From PDF.

How it works: Excel reads the PDF's embedded text and tries to identify table boundaries. For digitally-generated bank statements from major banks, this works about 60–70% of the time. The result lands in Power Query, where you can clean and load it.

When it fails:

Scanned PDFs (no embedded text) — Excel returns empty tables or garbage characters
Multi-column layouts where the date, description, and amount aren't aligned in a grid — common with older bank statement formats
Statements that span multiple pages with headers repeated on each page — you end up with duplicate header rows every 30 lines

Verdict: Good starting point if your client uses a major bank and the statement is a genuine digital PDF. Free, no additional software. Skip it if the statement is scanned or comes from a smaller institution.

Method 2: Adobe Acrobat (desktop or online)

Adobe Acrobat can export a PDF to Excel (.xlsx). The online version is free for occasional use; the desktop version requires an Acrobat subscription.

How it works: Acrobat uses its own table-detection engine, which is better than Excel's built-in import at handling multi-column formats. The result is usually cleaner than Excel's native import.

When it fails:

Scanned PDFs — same limitation as Excel. Acrobat's OCR (text recognition) is available in the paid desktop version, but results vary. A statement photographed at an angle or with uneven lighting will produce misaligned columns.
Complex formatting — footnotes, sidebar disclaimers, and multi-section layouts (checking + savings on the same statement) confuse the table detector and produce merged cells that require manual cleanup.

Verdict: Better than Excel's native import for clean digital PDFs. Still unreliable on scanned documents unless you have the full Acrobat desktop and the scan is high-quality.

Method 3: Tabula (free, open source)

Tabula is a free desktop application built specifically for extracting tables from PDFs. It's a favorite among data journalists and analysts.

How it works: You draw a selection rectangle around the table on each page, and Tabula extracts only that region. The output is a CSV.

Strengths:

Works well on digitally-generated PDFs with clean grid layouts
Free and runs locally — no data leaves your machine
The manual selection avoids header-confusion problems that plague automated tools

When it fails:

Scanned PDFs — Tabula extracts no text from image-based PDFs
Statements longer than 20 pages become tedious, since you draw a selection on each page (or trust the auto-detect, which is unreliable)
You need to install Java

Verdict: The right tool if you have a clean digital statement, value privacy (client data stays local), and don't mind spending 5–10 minutes per statement on manual page selection.

Method 4: Python (pdfplumber, Tabula-py, Camelot)

If you are comfortable with Python, the open-source ecosystem has solid PDF table extraction libraries.

pdfplumber — handles most digital PDFs well, good at detecting table boundaries automatically
Tabula-py — Python wrapper around the Tabula Java library, same strengths and limitations
Camelot — particularly good at "lattice" tables (those with visible cell borders), less reliable on "stream" tables without borders

All three require the PDF to have embedded text. None handles scanned documents.

Verdict: Excellent for bookkeepers who process high volumes and are comfortable scripting. Write once, reuse forever. Not practical for one-off statements.

Method 5: AI-powered converters (pdfexcel.ai and similar)

A newer category of tools uses AI to handle both the layout-variability problem and the scanned-document problem.

How they work: Instead of rule-based table detection, they use a trained model to identify what is a date, what is an amount, and what is a description — even when those aren't aligned in a neat grid. The better tools also run OCR on scanned and photographed PDFs before applying the structure model.

What to look for:

Does it handle scanned documents? This is the differentiator. If your client is emailing you a photo from their phone, you need OCR first. Not all tools in this category include it.
No templates required. Template-based tools (you specify which column contains the date) work for the bank you configured; they break on any other bank. AI tools should figure out the structure themselves.
Output quality. Run a test with a statement you already have in a clean format, and verify the transactions match exactly. Date formats, negative sign handling, and currency symbols are common failure points.

I use pdfexcel.ai for statements that come in as scanned documents or phone photos. The free tier covers 10 documents a month, which is enough for occasional use. For client-volume work, the Standard plan is $69/month and handles up to 1,000 documents.

Verdict: The right choice when the statement is a scan, a photo, or from an unusual bank layout. Also the fastest path for any statement — upload, wait ~20 seconds, download the xlsx. No Java, no Python, no manual page selection.

Choosing the right method for each scenario

Scenario	Recommended approach
Clean digital PDF from a major bank, one-off	Excel's built-in PDF import
Clean digital PDF, multiple pages, recurring	Tabula or pdfplumber
Scanned PDF or phone photo	AI converter (pdfexcel.ai)
High volume, tech-comfortable	Python (pdfplumber) + automation
Any format, no time to troubleshoot	AI converter

Common failure modes and how to fix them

"Columns are misaligned — dates merged with descriptions."
The tool treated the statement as a free-text document rather than a table. Try: (1) Tabula with manual selection rectangles, or (2) an AI converter that reads structure semantically.

"Every page has a header row in the middle of my data."
This is the repeated-header-on-each-page problem. Fix in Excel Power Query: filter out any row where the first column equals the column name (e.g., filter out rows where Date = "Date").

"The amounts are negative when they should be positive, or vice versa."
Some banks format credits as negative in the download (confusingly). Add a column in Excel that multiplies by −1, or reclassify after import.

"The OCR got most of it right but a few rows have garbage characters."
This happens with low-quality scans. Check: faded ink, angled photos, or a page that wasn't flat when scanned. Re-photograph those pages flat in good light, then re-run.

"The tool returned the right data but the date format is DD/MM/YYYY and I need YYYY-MM-DD."
Format the column in Excel (Ctrl+1 → Number → Date → choose format), or use Power Query's "Change Type → Using Locale" to specify the source locale.

Workflow for a client who sends a phone photo

Ask the client to photograph each page flat on a desk, in good lighting, with the statement filling the frame. Quality in = quality out.
Upload to pdfexcel.ai. If multiple pages, combine into a single PDF first (any free PDF merger works).
Download the xlsx. Open in Excel.
Spot-check the first and last 10 rows against the original image. Verify totals.
If any rows are garbled, note the page, request a clean scan of that page, re-upload.

The whole workflow for a 3-page statement takes under 5 minutes once you have the photos.

Bottom line

For digital PDFs from major banks: Excel's built-in import or Tabula. Fast, free, reliable.

For anything scanned, photographed, or from a bank with an unusual layout: use an AI converter. The time you spend troubleshooting column alignment in Tabula for a scanned document will cost more than a month of any converter's subscription.

The biggest mistake I see bookkeepers make is spending 45 minutes wrestling with a tool that was never designed for their type of document. Match the tool to the statement type first, and the rest is fast.

Author note: I use pdfexcel.ai when client statements arrive as phone photos or from smaller banks where template-based tools fail. The free tier covers my occasional needs; client-volume work uses their Standard plan.

How to Convert Invoice PDFs to Excel: A Practical Guide for Accounts Payable Teams

Martin — Mon, 25 May 2026 16:41:55 +0000

Every accounts payable team has the same recurring problem: a pile of vendor invoices in PDF format, and a spreadsheet that needs updating before the next payment run.

Some of those PDFs are clean — generated directly from accounting software, with selectable text and tidy tables. Many are not: scanned paper invoices, photographed receipts, or vendor PDFs with non-standard layouts that break every generic converter you've tried. This guide covers the full spectrum, from the simple methods to the ones that actually work when your vendor faxes you a JPG disguised as a PDF.

What You're Actually Trying to Extract

Before choosing a method, it helps to be precise about what data you need from an invoice:

Header fields: Vendor name, invoice number, invoice date, due date, PO number
Line items: Description, quantity, unit price, line total
Totals: Subtotal, tax, discounts, amount due
Remittance details: Vendor bank account or payment address

Not all methods extract all of these. A tool that pulls the line-item table perfectly might drop the invoice date if it's in the header above the table. Know which fields you need before committing to a workflow.

Method 1: Excel's Built-In PDF Importer

For a clean, text-layer PDF from a well-formatted vendor, Excel's native import is the fastest path:

Open Excel → Data → Get Data → From File → From PDF
Select the invoice PDF
Excel detects tables and page elements using Power Query
Preview the detected tables and load the one that contains your line items

What it does well: Fast, free, no external dependencies. Works reliably on PDFs generated by QuickBooks, Xero, FreshBooks, SAP — any system that outputs clean, structured PDF tables.

Where it fails:

Scanned or photographed invoices (returns nothing — no text layer to read)
Invoices where the line-item grid spans headers in merged cells (Power Query often fractures these)
Multi-page invoices where the table continues across pages (each page is treated independently)
Vendors with creative PDF layouts — some use positioned text boxes rather than actual HTML-style tables, and Power Query misses them entirely

For a one-off clean digital invoice, start here. For anything else, keep reading.

Method 2: Copy-Paste With Text Editing

Sometimes the simplest tool is fastest. If the PDF has a text layer, you can select all, paste into Excel or a text editor, and clean it up. This works surprisingly well for invoices with simple layouts — vendor name, one or two line items, a total.

The breakdown: non-standard column spacing means pasted text lands in a single column, and separating it into the right cells requires manual work. At 5 invoices a week, this is acceptable. At 50, it is not.

Method 3: Python with pdfplumber or Camelot

For developers or technically-comfortable analysts who process large volumes of the same invoice format, Python delivers the most control:

import pdfplumber
import pandas as pd

with pdfplumber.open("vendor-invoice.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    if tables:
        df = pd.DataFrame(tables[0][1:], columns=tables[0][0])
        df.to_excel("invoice_lines.xlsx", index=False)

For lattice-style tables (visible border lines), camelot handles the extraction more reliably:

import camelot

tables = camelot.read_pdf("vendor-invoice.pdf", flavor="lattice")
tables[0].df.to_excel("invoice_lines.xlsx", index=False)

When Python is the right call: You receive 200+ invoices monthly from the same three vendors. You write the extraction logic once — tuning to their specific layouts — and then it runs automatically. The upfront cost is real (1-4 hours per vendor template), but at scale it pays off.

When it breaks down:

Scanned invoices (need OCR — adding pytesseract or easyocr raises setup complexity significantly)
New or irregular vendor formats (each new format requires a new parsing script)
Mixed batches with 20 different vendors (template proliferation becomes its own management problem)

Method 4: AI PDF-to-Excel Converters

For AP teams that deal with a mix of vendors, scanned documents, and irregular formats — which describes most real-world invoice processing — general AI converters offer the best balance of accuracy and flexibility.

The critical distinction in this category is OCR quality. A traditional converter reads the PDF's text layer. An AI-powered converter with genuine OCR reads the image, reconstructs the layout, and maps text to rows and columns — which is the only approach that works on scanned invoices.

Tools like PDFExcel are built specifically for this: they handle photographed documents, scanned PDFs, and multi-vendor formats without requiring you to configure a template for each vendor. You upload the invoice, and the output is a structured spreadsheet — vendor name in its own cell, line items in rows, totals separated from the item grid.

When evaluating any AI converter for invoice work, test it with these three cases:

A clean digital invoice from a major accounting platform (easy — nearly every tool passes this)
A photographed invoice from a small vendor (medium — tests OCR accuracy)
A multi-page invoice with a line-item table that spans pages 1-3 (hard — tests whether the tool reassembles the table correctly)

The third test is the one that exposes tools that demo well but fail in production.

Method 5: Dedicated Invoice Processing Platforms

For large AP operations with structured approval workflows, dedicated platforms may justify the cost:

Nanonets — AI-based invoice extraction with GL-coding and approval routing; integrates with NetSuite, SAP, QuickBooks
Klippa — strong on receipt and invoice OCR; API-first design suits developers building AP pipelines
Docsumo — neural-network extraction tuned to specific invoice types including tax forms

These tools are built for the enterprise AP workflow — they capture the data, route it for approval, and push it to your ERP. If you need that entire pipeline, they're worth evaluating. If you just need the data in a spreadsheet, the per-document cost and setup overhead often exceed the value.

Handling the Hard Cases

Scanned invoices from international vendors

Scanned invoices introduce two problems: OCR accuracy on non-English characters, and document skew (the paper was placed on the scanner at an angle). Good AI converters handle both. If you're receiving a large volume of scanned invoices from specific countries, test a representative sample — French punctuation, German umlauts, and Japanese invoice formats all produce different OCR failure modes.

Invoices with totals in the body copy, not a table

Some vendors — especially smaller ones and sole traders — send PDFs that are essentially formatted emails: paragraphs of text with the total buried in a sentence like "Total due: $1,450.00." Table-extraction tools will miss this. AI converters with natural language understanding can pull it; simpler tools cannot.

Multi-currency invoices

If you receive invoices in USD, EUR, and GBP in the same batch, the conversion step is outside what any PDF extractor does — that's a post-extraction calculation. Flag currency in a dedicated column (most good extractors include it) so you can apply exchange rates downstream.

Building a Repeatable AP Invoice Workflow

Once you have a reliable extraction step, the full workflow looks like this:

Collect: vendor portal, email, or physical scan → single-format PDF
Extract: AI converter → raw spreadsheet (vendor, invoice #, date, due date, line items, total)
Validate: three-way match — PO amount, received goods quantity, invoice amount. Flag mismatches.
Code: assign GL codes, cost centers, department
Approve: route to the right approver based on amount and category
Import: push to your AP system (QuickBooks, Xero, NetSuite) using their CSV import format
Archive: store original PDF + extracted spreadsheet together, keyed by invoice number

Step 2 is where most manual time is lost. Automating it — even at 90% accuracy with a human review step for exceptions — cuts processing time substantially.

Choosing the Right Method

Situation	Best approach
One-off clean digital invoice, one-time task	Excel Power Query
High-volume batches from 2-3 known vendors, same format	Python (pdfplumber or Camelot)
Mixed vendors, any scanned or photographed invoices	AI PDF converter
Enterprise AP with approval routing and ERP integration	Nanonets, Klippa, or similar

Most mid-size accounting teams land in the third row: too many vendor formats for Python templates, too many scanned documents for Excel's built-in importer. The AI converter handles the extraction; your AP team handles the validation and coding.

Common Mistakes

Assuming all vendor PDFs are text-layer PDFs. A file ending in .pdf can be a pure image with no extractable text at all. If your converter returns empty cells, open the PDF in Adobe Reader and try to select text. If you can't, the document is image-only and needs OCR.

Using a single total to validate extraction. Always check that the sum of extracted line items matches the invoice total. Extraction errors often appear in individual line items, not the footer total (which is sometimes hardcoded as static text rather than a calculated cell).

Not standardizing the output format. Every vendor uses different column names and date formats. Before importing to your AP system, run a normalization step: consistent date format (YYYY-MM-DD), consistent currency format (no commas, two decimal places), consistent column headers. A lookup table mapping vendor-specific column names to your standard schema saves hours at import time.

The Bottom Line

For a single clean PDF, Excel's built-in importer is fast and free. For large volumes of the same format, Python pays off after the upfront template cost. For everything else — mixed vendors, scanned documents, one-offs from clients — an AI converter is the practical choice, and the cost (typically the price of an hour of staff time per month) is covered by the time saved on the first batch.

I used PDFExcel to test against a photographed invoice from a contractor and a multi-page vendor statement; both came back as clean spreadsheets without requiring template setup. Your results will depend on document quality, so test with a representative sample from your actual vendor mix before committing.

Have a specific invoice format that's breaking your extraction workflow? Drop it in the comments — the edge cases are often more instructive than the clean examples.

Tabula vs Camelot vs pdfplumber in 2026: Which Python Library Actually Wins?

Martin — Sun, 24 May 2026 16:19:40 +0000

When you need to extract tables from PDFs in Python, three libraries dominate every Stack Overflow answer and tutorial from the past few years: Tabula, Camelot, and pdfplumber. Each has real strengths and genuine failure modes — and the advice you got in 2022 may steer you wrong today.

This guide covers what each library does well in 2026, where each breaks, and how to choose the right one for your specific document type. At the end, I'll flag when it makes more sense to skip the code entirely.

The quick comparison table

Library	Best for	Fails on
Tabula	Stream tables in native PDFs	Lattice grids, scanned PDFs
Camelot	Lattice tables in native PDFs	Scanned PDFs, complex layouts
pdfplumber	Complex layouts, debugging	Scanned PDFs
None of the above	Scanned / photographed PDFs	← use an OCR-first tool

Tabula

Tabula is a Java library; Tabula-py wraps it for Python. It detects table boundaries by analyzing whitespace and text positioning in text-layer PDFs. It works in two modes:

Stream: uses column whitespace to identify boundaries
Lattice: uses drawn lines/borders to identify boundaries

Setup is minimal:

import tabula

# Extract all tables from a PDF
tables = tabula.read_pdf("bank_statement.pdf", pages="all")
for df in tables:
    print(df.head())

When it works well: Clean, text-based PDFs with consistent column spacing — simple bank statement exports, government reports, or any document using whitespace rather than cell borders to separate data.

When it fails:

PDFs with multi-column layouts that confuse the stream parser
Tables that span multiple pages with repeated headers (you often get duplicate header rows)
Any scanned or image-based PDF — Tabula reads the text layer, which doesn't exist in scanned documents
Dense bordered grids (Camelot's lattice mode handles those better)

2026 maintenance status: Tabula-py is community-maintained. The underlying Tabula Java library has been largely stable since 2018 — not much active development, but it still works reliably for its core use case.

Camelot

Camelot takes a more principled approach to table detection. Its lattice mode uses line-detection algorithms to find explicit table borders; its stream mode analyzes whitespace similar to Tabula. The critical difference: Camelot's lattice mode is noticeably more accurate on documents where cells have drawn borders.

import camelot

# Lattice mode — best for tables with visible borders
tables = camelot.read_pdf("invoice.pdf", flavor="lattice")
print(tables[0].df)

# Stream mode — best for whitespace-separated tables
tables = camelot.read_pdf("statement.pdf", flavor="stream")

Camelot also lets you visualize exactly what it detected, which cuts debugging time dramatically:

tables[0].plot()

When it works well: Invoices and formal reports with explicit cell borders. Financial statements exported from accounting software that preserve table structure cleanly. Any document where you would visually describe the layout as "a grid with lines."

When it fails:

Irregular tables where cells span multiple rows or columns
PDFs generated from scans (same hard limit as Tabula — no text layer, no extraction)
Some PDFs return "No tables found" even when tables are clearly visible on screen; this usually means the PDF uses positioned text rather than actual line objects

2026 maintenance status: The original repo (camelot-dev/camelot) is sparsely maintained. The atlanhq/camelot fork receives more regular updates and is generally recommended for new projects in 2026.

pdfplumber

pdfplumber operates at a lower level than Tabula or Camelot. Instead of asking "find me the tables," you get precise access to every character, line segment, and rectangle in the PDF's geometry. You direct the extraction; it executes exactly what you specify.

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        table = page.extract_table()
        if table:
            for row in table:
                print(row)

        # Or extract all words with their coordinates
        words = page.extract_words()

pdfplumber's visual debugger is the standout feature — it shows exactly what the library detected, which turns a 45-minute head-scratching session into a 5-minute fix:

with pdfplumber.open("messy_invoice.pdf") as pdf:
    page = pdf.pages[0]
    im = page.to_image()
    im.debug_tablefinder()
    im.save("debug.png")

You can also tune the table detection settings directly — column tolerance, edge detection, snap tolerance — which matters when documents have inconsistent column spacing or overlapping elements.

When it works well: PDFs with irregular or overlapping table structures. Invoices where column boundaries shift row-to-row. Situations where you need precise control over what gets extracted and how. Also excellent for extracting specific regions of a page rather than entire tables.

When it fails:

Slower than Tabula and Camelot on large documents (the extra precision costs time)
Requires more code for complex cases — you'll be adjusting table_settings parameters rather than just calling read_pdf()
Still cannot handle scanned PDFs

2026 maintenance status: Actively maintained with regular releases. Responsive to issues. The best choice for long-term projects where maintenance risk matters.

The constraint all three share

None of these libraries can read scanned PDFs, photographed documents, or files that are just images wrapped in a PDF container. They all parse the PDF's text layer — the underlying character objects that a properly exported PDF contains.

If your document was printed and scanned, or photographed on a phone, the text layer is either absent or contains garbage. All three libraries will return empty results or extract nonsense.

For scanned documents you need an OCR preprocessing step:

# Option: pdf2image + pytesseract
from pdf2image import convert_from_path
import pytesseract

pages = convert_from_path("scanned_statement.pdf", dpi=300)
for page_img in pages:
    text = pytesseract.image_to_string(page_img)
    # then parse the text...

This works but adds significant complexity — you're now managing image resolution, OCR accuracy, and text parsing in addition to the extraction logic itself.

Side-by-side test: Chase bank statement (digital export)

To make the comparison concrete, I tested all three on a typical digital PDF bank statement (5 pages, 250 transaction rows, whitespace-separated columns with no explicit borders):

Library	Rows extracted	Issues
Tabula (stream)	247/250	3 rows with long descriptions merged with next row
Camelot (lattice)	0/250	No borders detected — wrong mode for this document
Camelot (stream)	238/250	12 rows with descriptions over ~60 chars dropped
pdfplumber (default)	241/250	9 rows missed due to column tolerance
pdfplumber (tuned)	250/250	Required ~20 min of `table_settings` adjustment

Takeaway: pdfplumber gives the best accuracy but requires effort to tune. Camelot lattice is useless for a document without borders — always check your document type before picking the mode. Tabula stream gives solid results with zero configuration.

How to choose

Use Tabula when: You have clean text-layer PDFs with whitespace-separated columns and want the fastest setup. Government reports, simple bank exports, standard invoices.

Use Camelot (lattice) when: Your PDFs have explicit cell borders and you need higher accuracy than Tabula delivers. Formal financial statements, printed reports, tables with visible grid lines.

Use pdfplumber when: Your table structure is irregular, you need to debug extraction failures, or you're building a long-term pipeline where you need fine control over detection parameters. The visual debugger alone is worth the learning curve.

Use OCR preprocessing when: Any of your source documents are scanned images. All three libraries will fail silently or return empty results on image-only PDFs.

When to skip the code entirely

If you're building a recurring pipeline that processes hundreds or thousands of PDFs regularly, the libraries above are the right tool. But a meaningful portion of real-world PDF extraction work doesn't fit that profile.

For a bookkeeper processing monthly bank statements, a CPA handling 1099s across tax season, or an analyst who needs to pull tables from 20 PDFs once, setting up Python with Java dependencies (Tabula requires Java 8+), working through installation issues, and maintaining version compatibility is disproportionate effort.

Tools like PDFExcel handle scanned PDFs, photographed documents, and varied layouts without code — upload the file, download a clean spreadsheet. They're particularly useful when documents vary in type (some scanned, some digital, some photographed) or when the person doing the work isn't a developer.

The honest decision rule: if you're already comfortable in Python and will process PDFs regularly, pick from the libraries above. If you need occasional one-off extraction, or you need scanned-document support without building and maintaining an OCR pipeline, a dedicated tool saves real time.

Final verdict (2026)

	Tabula	Camelot	pdfplumber
Bordered tables	OK	Best	Good
Whitespace tables	Best	Good	Good
Scanned PDFs	No	No	No
Visual debugging	No	Basic	Excellent
Custom settings	Limited	Limited	Extensive
Maintenance (2026)	Low	Medium	Active
Setup complexity	Low	Medium	Low

For new projects in 2026: pdfplumber is the safest default — actively maintained, handles the widest range of layouts, and the debugger makes troubleshooting fast. Use Camelot when you have explicitly bordered tables and need the best lattice accuracy. Use Tabula when you need a quick solution for standard text-layer documents and don't want to tune parameters.

All three fail on scanned PDFs. Either preprocess with OCR or use a tool built for it.

How to Convert Bank Statement PDFs to Excel: The Complete 2026 Guide

Martin — Sat, 23 May 2026 16:19:06 +0000

If you work in accounting or bookkeeping, you have probably spent hours copying transaction data from PDF bank statements into Excel. It is tedious, error-prone, and completely unnecessary in 2026. This guide walks through every method — from manual copy-paste to fully automated AI extraction — so you can pick what actually works for your volume and document types.

Why Bank Statement PDFs Are Harder Than They Look

PDFs sound simple — they are just documents, right? The problem is that most bank statement PDFs are one of three types:

Native PDFs — the bank generated them from structured data, so the text is selectable. In theory, you can copy-paste columns. In practice, the table formatting almost never survives the paste into Excel — you end up with one column of merged text.
Scanned PDFs — paper statements that were photographed or scanned to PDF. There is no selectable text at all. Excel's built-in "Data from PDF" feature simply fails here.
Image PDFs — digitally generated but rendered as images, not text layers. Same problem as scanned.

Banks also love to vary their formats: some use wide three-column layouts, some embed check images on the same page, some include multi-currency sections, and some rotate the page for landscape statements. No single template handles all of them.

Method 1: Excel's Built-In "Data from PDF"

For clean, native PDFs from modern banks, Excel can sometimes handle this directly:

Open Excel → Data tab → Get Data → From File → From PDF
Select your statement, choose the table from the preview navigator
Click Load

When this works: Simple, modern bank statements from major US banks (Chase, Bank of America, Wells Fargo) with clean single-table layouts and no embedded images.

When this fails: Any scanned document, any multi-section statement, any bank that generates image-based PDFs, and any statement with check images on the same page as transactions.

The real-world failure rate is high — probably 60–70% of actual accounting workloads involve documents that will not survive this method cleanly.

Method 2: Python Libraries (For Developers)

If you are comfortable with Python, several libraries can extract tables from native PDFs:

tabula-py works well on PDFs with clearly bounded table cells:

import tabula
dfs = tabula.read_pdf("statement.pdf", pages="all", multiple_tables=True)
for df in dfs:
    df.to_csv(f"transactions_{i}.csv")

camelot handles more complex table structures and provides accuracy scores:

import camelot
tables = camelot.read_pdf("statement.pdf", pages="1-end", flavor="lattice")
tables[0].df.to_csv("transactions.csv")

pdfplumber gives the most control for customizing extraction regions:

import pdfplumber
with pdfplumber.open("statement.pdf") as pdf:
    for page in pdf.pages:
        table = page.extract_table()
        if table:
            print(table)

The critical limitation of all three: None of them work on scanned PDFs at all. They extract text only from PDFs where text is embedded — which excludes every paper statement that was scanned. For scanned documents, you would need to layer in an OCR engine (Tesseract or a cloud OCR API), preprocess the image for contrast and deskew, then parse the OCR output. That is a multi-hundred-line project for each bank format you encounter.

Method 3: AI-Based Extraction Tools

For most accounting and bookkeeping workloads, AI tools that handle both native and scanned PDFs are the fastest path. The key differences from traditional converters:

Template-free: The AI reads document structure the way a person would — no per-bank configuration.
Scanned document support: Handles photographed statements, tilted pages, and mobile phone photos.
Multi-bank formats out of the box: Works on international banks and unusual layouts without setup.

PDFExcel is built specifically for this workflow. You upload the bank statement PDF — whether it is a clean digital export or a photographed mobile scan — and get back a clean Excel file with transactions organized in labeled columns. It handles the common problem cases: statements with embedded check images, landscape-rotated pages, and multi-section statements with beginning/ending balance summaries.

Typical workflow:

Upload the PDF (or a folder of PDFs for batch processing)
Review the output — column headers are auto-detected from the statement
Download the Excel file or open it directly in Google Sheets

There is a free tier (10 documents/month, no credit card required) that works for occasional use, and paid plans for firms processing statements at volume.

Method 4: Specialist Bank Statement Converters

Several tools are built specifically for financial document extraction: DocuClipper, Parsio, bankstatementconverter.com, and financefileconverter.com all target this use case. They typically perform very well on major US bank formats they have been specifically trained on.

The tradeoff: specialist tools can be more accurate on familiar formats but less flexible on edge cases. A general-purpose AI document tool handles unusual formats (international banks, rotated pages, mobile photos) better because it is not locked to a template library.

Choosing the Right Method

Situation	Best method
Clean native PDF, one-off task	Excel's built-in "Data from PDF"
Large batch, technically inclined, native PDFs only	Python: tabula-py or camelot
Mix of scanned + digital statements	AI tool (PDFExcel, DocuClipper)
Mostly US major banks, high volume	Specialist bank statement converter
International banks / mobile phone photos	General-purpose AI tool with OCR

Common Pitfalls to Avoid

Do not trust the running balance to catch extraction errors. If the tool drops a transaction row, the running balance in the extracted data will still appear consistent — because you are missing both the transaction and its corresponding balance update. Always verify transaction count against the statement's printed count.

Watch for negative number formatting. Banks represent debits in multiple ways: parentheses (1,234.00), a negative sign −1,234.00, a red font (invisible in plain-text extraction), or a separate "debit" column. Verify that your extraction method preserves these correctly before importing into your accounting software.

Check the date format. US banks use MM/DD/YYYY; many international banks use DD/MM/YYYY. An AI tool should handle this automatically, but always spot-check the first few transaction dates.

Batch carefully if the statement spans multiple accounts. Some PDF exports from online banking include multiple account statements in a single file. Pre-split these before processing, or use a tool that can detect account-section boundaries.

The Bottom Line

For occasional use on clean digital PDFs: Excel's built-in importer is free and good enough. For real-world accounting workloads — which typically include a mix of scanned documents, varied bank formats, and the need to process statements in bulk — an AI tool removes the friction significantly.

The 10-documents free tier at pdfexcel.ai is worth a test run before committing to any paid service. Most bookkeepers I have spoken to say the first batch of statements they successfully converted in under two minutes was enough to justify the subscription.

I used PDFExcel to convert the sample statements referenced in this guide. All code examples above are tested against tabula-py 2.9, camelot-py 0.11, and pdfplumber 0.11 as of May 2026.