Document AI Toolkit
Turn unstructured documents into structured, queryable data. This toolkit provides complete pipelines for parsing PDFs, extracting tables, running OCR on scanned documents, summarizing long-form content, and pulling structured fields from invoices, contracts, and reports. Built as composable pipeline stages so you can mix, match, and extend for your specific document types.
Key Features
- PDF Parsing Engine — Extract text, metadata, and layout information from PDFs with support for multi-column layouts and embedded images
- OCR Integration — Process scanned documents and images with configurable OCR backends (Tesseract, cloud APIs) and pre-processing for skew correction
- Table Extraction — Detect and extract tables from PDFs and images into pandas DataFrames or CSV, handling merged cells and spanning headers
- Summarization Chains — Multi-stage summarization for long documents: chunk → summarize → merge, with configurable compression ratios
- Structured Data Extraction — Define extraction schemas and pull typed fields (dates, amounts, names, addresses) from any document
- Document Classification — Automatically categorize incoming documents by type (invoice, contract, report, letter) before routing to specialized extractors
- Batch Processing — Process thousands of documents in parallel with progress tracking, retry logic, and partial-failure handling
Quick Start
from document_ai import DocumentPipeline, stages
# 1. Build a pipeline
pipeline = DocumentPipeline([
stages.PDFParser(extract_images=True),
stages.OCR(engine="tesseract", language="eng"),
stages.TableExtractor(output_format="dataframe"),
stages.Summarizer(model="gpt-4o-mini", max_summary_length=200),
stages.StructuredExtractor(schema="schemas/invoice.yaml"),
])
# 2. Process a document
result = pipeline.process("documents/invoice_2025_Q3.pdf")
print(result.text[:500]) # Full extracted text
print(result.tables[0].to_csv()) # First table as CSV
print(result.summary) # LLM-generated summary
print(result.structured_data) # {"vendor": "Acme Corp", "total": 4250.00, ...}
Architecture
Input Document (PDF / Image / Scan)
│
▼
┌─────────────────┐
│ PDF Parser │──── Extract text + layout + embedded images
└────────┬────────┘
│
├── Has text ──────────────▶ Text output
│
└── Image/Scan ──▶ ┌───────────┐
│ OCR │──── Text from images
└─────┬─────┘
│
┌────────────────────────┘
▼
┌─────────────────┐
│ Table Extractor │──── Detect tables → DataFrames
└────────┬────────┘
▼
┌─────────────────┐
│ Summarizer │──── Chunk → Summarize → Merge
└────────┬────────┘
▼
┌─────────────────┐
│ Schema Extractor│──── Extract typed fields per schema
└────────┬────────┘
▼
DocumentResult (text, tables, summary, structured_data, metadata)
Usage Examples
Define Custom Extraction Schemas
from document_ai.extraction import ExtractionSchema, Field
invoice_schema = ExtractionSchema(
name="invoice",
fields=[
Field("vendor_name", type="string", description="Company that issued the invoice"),
Field("invoice_number", type="string", pattern=r"INV-\d+"),
Field("date", type="date", formats=["%Y-%m-%d", "%m/%d/%Y"]),
Field("line_items", type="list", item_schema={
"description": "string",
"quantity": "integer",
"unit_price": "float",
}),
Field("total_amount", type="float", description="Total amount due"),
],
)
result = pipeline.process("invoice.pdf", schema=invoice_schema)
print(result.structured_data)
# {"vendor_name": "Acme Corp", "invoice_number": "INV-2025-0042",
# "date": "2025-03-15", "total_amount": 12750.00, "line_items": [...]}
Batch Processing with Progress Tracking
from document_ai import BatchProcessor
from pathlib import Path
processor = BatchProcessor(
pipeline=pipeline,
max_workers=4,
retry_on_failure=True,
max_retries=2,
)
results = processor.process_directory(
Path("documents/inbox/"),
glob_pattern="*.pdf",
output_dir=Path("documents/processed/"),
)
print(f"Processed: {results.success_count}/{results.total_count}")
print(f"Failed: {[f.filename for f in results.failures]}")
Multi-Stage Summarization for Long Documents
from document_ai.summarization import MapReduceSummarizer
summarizer = MapReduceSummarizer(
chunk_size=2000, # Tokens per chunk
chunk_overlap=200, # Overlap between chunks
map_model="gpt-4o-mini", # Cheap model for individual chunks
reduce_model="gpt-4o", # Better model for final merge
final_length=500, # Target summary length in tokens
)
summary = summarizer.summarize(long_document_text)
Configuration
# document_ai_config.yaml
pdf_parser:
extract_images: true
image_dpi: 300 # DPI for image extraction
layout_analysis: true # Detect columns, headers, footers
password: null # For encrypted PDFs
ocr:
engine: "tesseract" # tesseract | google_vision | aws_textract
language: "eng"
preprocessing:
deskew: true # Correct page rotation
denoise: true # Remove noise from scanned docs
binarize: true # Convert to black/white
confidence_threshold: 0.6 # Below this, flag for manual review
table_extraction:
detection_method: "lattice" # lattice | stream | hybrid
output_format: "dataframe" # dataframe | csv | json
merge_adjacent: true # Merge tables split across pages
summarization:
model: "gpt-4o-mini"
strategy: "map_reduce" # map_reduce | refine | stuff
chunk_size: 2000
chunk_overlap: 200
max_summary_length: 300
extraction:
model: "gpt-4o"
confidence_threshold: 0.8 # Flag low-confidence extractions
validate_types: true # Enforce field type constraints
schema_dir: "schemas/"
batch:
max_workers: 4
max_retries: 2
output_format: "json" # json | csv | parquet
save_intermediate: false # Save per-stage outputs for debugging
Best Practices
- Pre-process scanned documents — Deskew, denoise, and binarize before OCR to dramatically improve text quality.
-
Use the cheapest model for summarization chunks — Only use GPT-4 for the final merge step;
gpt-4o-minihandles individual chunks well. - Define schemas per document type — Generic extraction is weak. Dedicated schemas for invoices, contracts, and reports yield much higher accuracy.
-
Validate extraction results — Always check
confidence_thresholdon extracted fields and route low-confidence items for human review. -
Process in batches, not one-by-one — The
BatchProcessorhandles parallelism, retries, and partial failures automatically. - Keep OCR language packs minimal — Only install language packs you actually need to keep the deployment lightweight.
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| OCR produces garbled text | Poor scan quality or wrong language | Enable preprocessing (deskew, denoise) and verify language setting |
| Table extraction misses tables | Tables use borderless/minimal styling | Switch detection_method to stream or hybrid
|
| Summarization loses critical details | Chunk size too small, important info split | Increase chunk_overlap to 300+ and use refine strategy |
| Structured extraction returns null fields | Schema field descriptions too vague | Add specific descriptions and example values to each Field
|
This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [Document AI Toolkit] with all files, templates, and documentation for $49.
Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.
Top comments (0)