DEV Community

Alex Jay
Alex Jay

Posted on

How to Extract Data from Invoices with Python (3 Lines of Code)

If you've ever had to manually type invoice data into a spreadsheet — vendor names, totals, line items, due dates — you know how painfully slow and error-prone it is.

I needed to automate this for a project and couldn't find anything that didn't require training custom ML models or setting up heavy cloud infrastructure. So I built
aPapyr — a simple API that reads invoices (and receipts, tax forms, bank statements) and returns clean, structured JSON.

Here's how it works in Python.

## Install


bash
  pip install apapyr

  Extract an Invoice

  from apapyr import aPapyr

  client = aPapyr("sk_live_your_key")
  result = client.extract("invoice.pdf")

  print(result.get_field("vendor_name"))  # "Acme Corp"
  print(result.get_field("total"))         # 1250.00
  print(result.get_field("due_date"))      # "2026-04-15"

  That's it. Three lines after setup. Send a PDF or image, get structured data back.

  What You Get Back

  Every field comes with a confidence score (0.0 to 1.0) so you know how reliable each value is:

  print(result.confidence)                     # 0.97 (overall)
  print(result.get_field_confidence("total"))  # 0.98
  print(result.get_field_confidence("notes"))  # 0.72 (handwritten, less certain)

  You decide your automation threshold. Confidence above 0.95? Auto-process it. Below 0.8? Flag it for human review.

  Line Items Too

  It doesn't just pull header fields — it extracts every line item:

  for item in result.line_items:
      desc = item.get("description", {}).get("value")
      qty = item.get("quantity", {}).get("value")
      amt = item.get("amount", {}).get("value")
      print(f"{desc}: {qty} x ${amt}")

  # Widget A: 50 x $25.00
  # Widget B: 30 x $12.50

  The API even cross-checks line item totals against the stated total and warns you if they don't add up.

  Flat Dictionary Output

  If you just want a simple key-value dict without confidence scores (for piping into a database or CSV):

  print(result.to_flat_dict())
  # {"document_type": "invoice", "vendor_name": "Acme Corp", "total": 1250.00, "due_date": "2026-04-15", ...}

  Supported Document Types

  It's not just invoices. Pass document_type="auto" (the default) and it detects the type automatically:

  - Invoices — vendor, total, tax, due date, line items
  - Receipts — merchant, items, subtotal, tax, tip, payment method
  - W-2 Tax Forms — employer, wages, withholdings
  - Bank Statements — balances, transaction history
  - Contracts — parties, dates, key terms

  Works With AI Agents Too

  If you use Claude Code, Cursor, or any MCP-compatible AI assistant:

  claude mcp add apapyr -- npx apapyr-mcp-server

  Then just ask: "Extract the data from invoice.pdf" — your AI handles everything.

  Try It Free

  - https://apapyr.com/free-tool.html — upload a document, no signup needed
  - https://apapyr.com/dashboard.html — 50 pages/month free, no credit card
  - https://apapyr.com/docs.html — full reference with examples in Python, Node.js, and cURL

  The free tier is enough to test it on real documents. If you're processing thousands of invoices, paid plans start at $49/month.

  ---
  aPapyr is open source on https://github.com/AkilaJ?tab=repositories&q=apapyr. Star it if you find it useful.

  ---
Enter fullscreen mode Exit fullscreen mode

Top comments (0)