Every back-office workflow starts with a stack of PDFs. Invoice processing, loan underwriting, insurance claims, legal review — they all begin with unstructured documents and end with data that needs to go into a database.
Traditional OCR + template engines are brittle and require months of configuration per document type. mawlaia-docparse uses LLMs to make this generic.
The core idea
from docparse import Extractor, InvoiceSchema
extractor = Extractor(schema=InvoiceSchema())
result = extractor.extract("invoice.pdf")
print(result.vendor_name) # "Acme Corp"
print(result.total_amount) # 1250.00
print(result.line_items) # [{"description": "...", "qty": 2, "unit_price": 625.0}]
print(result.confidence) # 0.94
Five vertical schemas
InvoiceSchema — vendor, buyer, line items, totals, payment terms, tax, dates.
ContractSchema — parties, effective date, termination clauses, obligations, jurisdiction.
MedicalRecordSchema — patient demographics, diagnoses (ICD codes), medications, procedures, dates.
FinancialStatementSchema — revenue, expenses, EBITDA, balance sheet items, period.
IDDocumentSchema — name, DOB, document number, expiry, issuing authority.
Custom schemas
from docparse import Extractor, BaseSchema
from pydantic import BaseModel
from typing import List
class PurchaseOrderSchema(BaseSchema):
po_number: str
supplier: str
line_items: List[dict]
delivery_date: str
result = Extractor(schema=PurchaseOrderSchema()).extract("po_12345.pdf")
TypeScript
import { Extractor, InvoiceSchema } from 'mawlaia-docparse';
const result = await new Extractor({ schema: new InvoiceSchema() }).extract('invoice.pdf');
console.log(result.vendorName, result.totalAmount);
Installation
pip install mawlaia-docparse
npm install mawlaia-docparse
Source, tests (45 Python, 61 TypeScript), MIT: github.com/Mawlaia-Labs/docparse
Hosted version with batch processing, webhook delivery, and fine-tuned vertical models coming Q3 2026. Early access: dev@mawlaia.com.
Top comments (0)