Extract structured data from any PDF with one line of Python (open-source)

#python #ai #document #automation

Every back-office workflow starts with a stack of PDFs. Invoice processing, loan underwriting, insurance claims, legal review — they all begin with unstructured documents and end with data that needs to go into a database.

Traditional OCR + template engines are brittle and require months of configuration per document type. mawlaia-docparse uses LLMs to make this generic.

The core idea

from docparse import Extractor, InvoiceSchema

extractor = Extractor(schema=InvoiceSchema())
result = extractor.extract("invoice.pdf")

print(result.vendor_name)    # "Acme Corp"
print(result.total_amount)   # 1250.00
print(result.line_items)     # [{"description": "...", "qty": 2, "unit_price": 625.0}]
print(result.confidence)     # 0.94

Five vertical schemas

InvoiceSchema — vendor, buyer, line items, totals, payment terms, tax, dates.

ContractSchema — parties, effective date, termination clauses, obligations, jurisdiction.

MedicalRecordSchema — patient demographics, diagnoses (ICD codes), medications, procedures, dates.

FinancialStatementSchema — revenue, expenses, EBITDA, balance sheet items, period.

IDDocumentSchema — name, DOB, document number, expiry, issuing authority.

Custom schemas

from docparse import Extractor, BaseSchema
from pydantic import BaseModel
from typing import List

class PurchaseOrderSchema(BaseSchema):
    po_number: str
    supplier: str
    line_items: List[dict]
    delivery_date: str

result = Extractor(schema=PurchaseOrderSchema()).extract("po_12345.pdf")

TypeScript

import { Extractor, InvoiceSchema } from 'mawlaia-docparse';

const result = await new Extractor({ schema: new InvoiceSchema() }).extract('invoice.pdf');
console.log(result.vendorName, result.totalAmount);

Installation

pip install mawlaia-docparse

npm install mawlaia-docparse

Source, tests (45 Python, 61 TypeScript), MIT: github.com/Mawlaia-Labs/docparse

Hosted version with batch processing, webhook delivery, and fine-tuned vertical models coming Q3 2026. Early access: dev@mawlaia.com.

DEV Community