DEV Community

mawlaia
mawlaia

Posted on

Extract structured data from any PDF with one line of Python (open-source)

Every back-office workflow starts with a stack of PDFs. Invoice processing, loan underwriting, insurance claims, legal review — they all begin with unstructured documents and end with data that needs to go into a database.

Traditional OCR + template engines are brittle and require months of configuration per document type. mawlaia-docparse uses LLMs to make this generic.


The core idea

from docparse import Extractor, InvoiceSchema

extractor = Extractor(schema=InvoiceSchema())
result = extractor.extract("invoice.pdf")

print(result.vendor_name)    # "Acme Corp"
print(result.total_amount)   # 1250.00
print(result.line_items)     # [{"description": "...", "qty": 2, "unit_price": 625.0}]
print(result.confidence)     # 0.94
Enter fullscreen mode Exit fullscreen mode

Five vertical schemas

InvoiceSchema — vendor, buyer, line items, totals, payment terms, tax, dates.

ContractSchema — parties, effective date, termination clauses, obligations, jurisdiction.

MedicalRecordSchema — patient demographics, diagnoses (ICD codes), medications, procedures, dates.

FinancialStatementSchema — revenue, expenses, EBITDA, balance sheet items, period.

IDDocumentSchema — name, DOB, document number, expiry, issuing authority.


Custom schemas

from docparse import Extractor, BaseSchema
from pydantic import BaseModel
from typing import List

class PurchaseOrderSchema(BaseSchema):
    po_number: str
    supplier: str
    line_items: List[dict]
    delivery_date: str

result = Extractor(schema=PurchaseOrderSchema()).extract("po_12345.pdf")
Enter fullscreen mode Exit fullscreen mode

TypeScript

import { Extractor, InvoiceSchema } from 'mawlaia-docparse';

const result = await new Extractor({ schema: new InvoiceSchema() }).extract('invoice.pdf');
console.log(result.vendorName, result.totalAmount);
Enter fullscreen mode Exit fullscreen mode

Installation

pip install mawlaia-docparse
Enter fullscreen mode Exit fullscreen mode
npm install mawlaia-docparse
Enter fullscreen mode Exit fullscreen mode

Source, tests (45 Python, 61 TypeScript), MIT: github.com/Mawlaia-Labs/docparse


Hosted version with batch processing, webhook delivery, and fine-tuned vertical models coming Q3 2026. Early access: dev@mawlaia.com.

Top comments (0)