Every accounts payable team has the same recurring problem: a pile of vendor invoices in PDF format, and a spreadsheet that needs updating before the next payment run.
Some of those PDFs are clean — generated directly from accounting software, with selectable text and tidy tables. Many are not: scanned paper invoices, photographed receipts, or vendor PDFs with non-standard layouts that break every generic converter you've tried. This guide covers the full spectrum, from the simple methods to the ones that actually work when your vendor faxes you a JPG disguised as a PDF.
What You're Actually Trying to Extract
Before choosing a method, it helps to be precise about what data you need from an invoice:
- Header fields: Vendor name, invoice number, invoice date, due date, PO number
- Line items: Description, quantity, unit price, line total
- Totals: Subtotal, tax, discounts, amount due
- Remittance details: Vendor bank account or payment address
Not all methods extract all of these. A tool that pulls the line-item table perfectly might drop the invoice date if it's in the header above the table. Know which fields you need before committing to a workflow.
Method 1: Excel's Built-In PDF Importer
For a clean, text-layer PDF from a well-formatted vendor, Excel's native import is the fastest path:
- Open Excel → Data → Get Data → From File → From PDF
- Select the invoice PDF
- Excel detects tables and page elements using Power Query
- Preview the detected tables and load the one that contains your line items
What it does well: Fast, free, no external dependencies. Works reliably on PDFs generated by QuickBooks, Xero, FreshBooks, SAP — any system that outputs clean, structured PDF tables.
Where it fails:
- Scanned or photographed invoices (returns nothing — no text layer to read)
- Invoices where the line-item grid spans headers in merged cells (Power Query often fractures these)
- Multi-page invoices where the table continues across pages (each page is treated independently)
- Vendors with creative PDF layouts — some use positioned text boxes rather than actual HTML-style tables, and Power Query misses them entirely
For a one-off clean digital invoice, start here. For anything else, keep reading.
Method 2: Copy-Paste With Text Editing
Sometimes the simplest tool is fastest. If the PDF has a text layer, you can select all, paste into Excel or a text editor, and clean it up. This works surprisingly well for invoices with simple layouts — vendor name, one or two line items, a total.
The breakdown: non-standard column spacing means pasted text lands in a single column, and separating it into the right cells requires manual work. At 5 invoices a week, this is acceptable. At 50, it is not.
Method 3: Python with pdfplumber or Camelot
For developers or technically-comfortable analysts who process large volumes of the same invoice format, Python delivers the most control:
import pdfplumber
import pandas as pd
with pdfplumber.open("vendor-invoice.pdf") as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
if tables:
df = pd.DataFrame(tables[0][1:], columns=tables[0][0])
df.to_excel("invoice_lines.xlsx", index=False)
For lattice-style tables (visible border lines), camelot handles the extraction more reliably:
import camelot
tables = camelot.read_pdf("vendor-invoice.pdf", flavor="lattice")
tables[0].df.to_excel("invoice_lines.xlsx", index=False)
When Python is the right call: You receive 200+ invoices monthly from the same three vendors. You write the extraction logic once — tuning to their specific layouts — and then it runs automatically. The upfront cost is real (1-4 hours per vendor template), but at scale it pays off.
When it breaks down:
- Scanned invoices (need OCR — adding
pytesseractoreasyocrraises setup complexity significantly) - New or irregular vendor formats (each new format requires a new parsing script)
- Mixed batches with 20 different vendors (template proliferation becomes its own management problem)
Method 4: AI PDF-to-Excel Converters
For AP teams that deal with a mix of vendors, scanned documents, and irregular formats — which describes most real-world invoice processing — general AI converters offer the best balance of accuracy and flexibility.
The critical distinction in this category is OCR quality. A traditional converter reads the PDF's text layer. An AI-powered converter with genuine OCR reads the image, reconstructs the layout, and maps text to rows and columns — which is the only approach that works on scanned invoices.
Tools like PDFExcel are built specifically for this: they handle photographed documents, scanned PDFs, and multi-vendor formats without requiring you to configure a template for each vendor. You upload the invoice, and the output is a structured spreadsheet — vendor name in its own cell, line items in rows, totals separated from the item grid.
When evaluating any AI converter for invoice work, test it with these three cases:
- A clean digital invoice from a major accounting platform (easy — nearly every tool passes this)
- A photographed invoice from a small vendor (medium — tests OCR accuracy)
- A multi-page invoice with a line-item table that spans pages 1-3 (hard — tests whether the tool reassembles the table correctly)
The third test is the one that exposes tools that demo well but fail in production.
Method 5: Dedicated Invoice Processing Platforms
For large AP operations with structured approval workflows, dedicated platforms may justify the cost:
- Nanonets — AI-based invoice extraction with GL-coding and approval routing; integrates with NetSuite, SAP, QuickBooks
- Klippa — strong on receipt and invoice OCR; API-first design suits developers building AP pipelines
- Docsumo — neural-network extraction tuned to specific invoice types including tax forms
These tools are built for the enterprise AP workflow — they capture the data, route it for approval, and push it to your ERP. If you need that entire pipeline, they're worth evaluating. If you just need the data in a spreadsheet, the per-document cost and setup overhead often exceed the value.
Handling the Hard Cases
Scanned invoices from international vendors
Scanned invoices introduce two problems: OCR accuracy on non-English characters, and document skew (the paper was placed on the scanner at an angle). Good AI converters handle both. If you're receiving a large volume of scanned invoices from specific countries, test a representative sample — French punctuation, German umlauts, and Japanese invoice formats all produce different OCR failure modes.
Invoices with totals in the body copy, not a table
Some vendors — especially smaller ones and sole traders — send PDFs that are essentially formatted emails: paragraphs of text with the total buried in a sentence like "Total due: $1,450.00." Table-extraction tools will miss this. AI converters with natural language understanding can pull it; simpler tools cannot.
Multi-currency invoices
If you receive invoices in USD, EUR, and GBP in the same batch, the conversion step is outside what any PDF extractor does — that's a post-extraction calculation. Flag currency in a dedicated column (most good extractors include it) so you can apply exchange rates downstream.
Building a Repeatable AP Invoice Workflow
Once you have a reliable extraction step, the full workflow looks like this:
- Collect: vendor portal, email, or physical scan → single-format PDF
- Extract: AI converter → raw spreadsheet (vendor, invoice #, date, due date, line items, total)
- Validate: three-way match — PO amount, received goods quantity, invoice amount. Flag mismatches.
- Code: assign GL codes, cost centers, department
- Approve: route to the right approver based on amount and category
- Import: push to your AP system (QuickBooks, Xero, NetSuite) using their CSV import format
- Archive: store original PDF + extracted spreadsheet together, keyed by invoice number
Step 2 is where most manual time is lost. Automating it — even at 90% accuracy with a human review step for exceptions — cuts processing time substantially.
Choosing the Right Method
| Situation | Best approach |
|---|---|
| One-off clean digital invoice, one-time task | Excel Power Query |
| High-volume batches from 2-3 known vendors, same format | Python (pdfplumber or Camelot) |
| Mixed vendors, any scanned or photographed invoices | AI PDF converter |
| Enterprise AP with approval routing and ERP integration | Nanonets, Klippa, or similar |
Most mid-size accounting teams land in the third row: too many vendor formats for Python templates, too many scanned documents for Excel's built-in importer. The AI converter handles the extraction; your AP team handles the validation and coding.
Common Mistakes
Assuming all vendor PDFs are text-layer PDFs. A file ending in .pdf can be a pure image with no extractable text at all. If your converter returns empty cells, open the PDF in Adobe Reader and try to select text. If you can't, the document is image-only and needs OCR.
Using a single total to validate extraction. Always check that the sum of extracted line items matches the invoice total. Extraction errors often appear in individual line items, not the footer total (which is sometimes hardcoded as static text rather than a calculated cell).
Not standardizing the output format. Every vendor uses different column names and date formats. Before importing to your AP system, run a normalization step: consistent date format (YYYY-MM-DD), consistent currency format (no commas, two decimal places), consistent column headers. A lookup table mapping vendor-specific column names to your standard schema saves hours at import time.
The Bottom Line
For a single clean PDF, Excel's built-in importer is fast and free. For large volumes of the same format, Python pays off after the upfront template cost. For everything else — mixed vendors, scanned documents, one-offs from clients — an AI converter is the practical choice, and the cost (typically the price of an hour of staff time per month) is covered by the time saved on the first batch.
I used PDFExcel to test against a photographed invoice from a contractor and a multi-page vendor statement; both came back as clean spreadsheets without requiring template setup. Your results will depend on document quality, so test with a representative sample from your actual vendor mix before committing.
Have a specific invoice format that's breaking your extraction workflow? Drop it in the comments — the edge cases are often more instructive than the clean examples.
Top comments (0)