Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Document AI Toolkit

#ai #llm #machinelearning #python

Document AI Toolkit

Turn unstructured documents into structured, queryable data. This toolkit provides complete pipelines for parsing PDFs, extracting tables, running OCR on scanned documents, summarizing long-form content, and pulling structured fields from invoices, contracts, and reports. Built as composable pipeline stages so you can mix, match, and extend for your specific document types.

Key Features

PDF Parsing Engine — Extract text, metadata, and layout information from PDFs with support for multi-column layouts and embedded images
OCR Integration — Process scanned documents and images with configurable OCR backends (Tesseract, cloud APIs) and pre-processing for skew correction
Table Extraction — Detect and extract tables from PDFs and images into pandas DataFrames or CSV, handling merged cells and spanning headers
Summarization Chains — Multi-stage summarization for long documents: chunk → summarize → merge, with configurable compression ratios
Structured Data Extraction — Define extraction schemas and pull typed fields (dates, amounts, names, addresses) from any document
Document Classification — Automatically categorize incoming documents by type (invoice, contract, report, letter) before routing to specialized extractors
Batch Processing — Process thousands of documents in parallel with progress tracking, retry logic, and partial-failure handling

Quick Start

from document_ai import DocumentPipeline, stages

# 1. Build a pipeline
pipeline = DocumentPipeline([
    stages.PDFParser(extract_images=True),
    stages.OCR(engine="tesseract", language="eng"),
    stages.TableExtractor(output_format="dataframe"),
    stages.Summarizer(model="gpt-4o-mini", max_summary_length=200),
    stages.StructuredExtractor(schema="schemas/invoice.yaml"),
])

# 2. Process a document
result = pipeline.process("documents/invoice_2025_Q3.pdf")

print(result.text[:500])           # Full extracted text
print(result.tables[0].to_csv())   # First table as CSV
print(result.summary)              # LLM-generated summary
print(result.structured_data)      # {"vendor": "Acme Corp", "total": 4250.00, ...}

Architecture

Input Document (PDF / Image / Scan)
         │
         ▼
┌─────────────────┐
│   PDF Parser    │──── Extract text + layout + embedded images
└────────┬────────┘
         │
         ├── Has text ──────────────▶ Text output
         │
         └── Image/Scan ──▶ ┌───────────┐
                            │    OCR     │──── Text from images
                            └─────┬─────┘
                                  │
         ┌────────────────────────┘
         ▼
┌─────────────────┐
│ Table Extractor │──── Detect tables → DataFrames
└────────┬────────┘
         ▼
┌─────────────────┐
│  Summarizer     │──── Chunk → Summarize → Merge
└────────┬────────┘
         ▼
┌─────────────────┐
│ Schema Extractor│──── Extract typed fields per schema
└────────┬────────┘
         ▼
    DocumentResult (text, tables, summary, structured_data, metadata)

Usage Examples

Define Custom Extraction Schemas

from document_ai.extraction import ExtractionSchema, Field

invoice_schema = ExtractionSchema(
    name="invoice",
    fields=[
        Field("vendor_name", type="string", description="Company that issued the invoice"),
        Field("invoice_number", type="string", pattern=r"INV-\d+"),
        Field("date", type="date", formats=["%Y-%m-%d", "%m/%d/%Y"]),
        Field("line_items", type="list", item_schema={
            "description": "string",
            "quantity": "integer",
            "unit_price": "float",
        }),
        Field("total_amount", type="float", description="Total amount due"),
    ],
)

result = pipeline.process("invoice.pdf", schema=invoice_schema)
print(result.structured_data)
# {"vendor_name": "Acme Corp", "invoice_number": "INV-2025-0042",
#  "date": "2025-03-15", "total_amount": 12750.00, "line_items": [...]}

Batch Processing with Progress Tracking

from document_ai import BatchProcessor
from pathlib import Path

processor = BatchProcessor(
    pipeline=pipeline,
    max_workers=4,
    retry_on_failure=True,
    max_retries=2,
)

results = processor.process_directory(
    Path("documents/inbox/"),
    glob_pattern="*.pdf",
    output_dir=Path("documents/processed/"),
)

print(f"Processed: {results.success_count}/{results.total_count}")
print(f"Failed: {[f.filename for f in results.failures]}")

Multi-Stage Summarization for Long Documents

from document_ai.summarization import MapReduceSummarizer

summarizer = MapReduceSummarizer(
    chunk_size=2000,         # Tokens per chunk
    chunk_overlap=200,       # Overlap between chunks
    map_model="gpt-4o-mini", # Cheap model for individual chunks
    reduce_model="gpt-4o",   # Better model for final merge
    final_length=500,        # Target summary length in tokens
)

summary = summarizer.summarize(long_document_text)

Configuration

# document_ai_config.yaml
pdf_parser:
  extract_images: true
  image_dpi: 300               # DPI for image extraction
  layout_analysis: true        # Detect columns, headers, footers
  password: null               # For encrypted PDFs

ocr:
  engine: "tesseract"          # tesseract | google_vision | aws_textract
  language: "eng"
  preprocessing:
    deskew: true               # Correct page rotation
    denoise: true              # Remove noise from scanned docs
    binarize: true             # Convert to black/white
  confidence_threshold: 0.6    # Below this, flag for manual review

table_extraction:
  detection_method: "lattice"  # lattice | stream | hybrid
  output_format: "dataframe"   # dataframe | csv | json
  merge_adjacent: true         # Merge tables split across pages

summarization:
  model: "gpt-4o-mini"
  strategy: "map_reduce"       # map_reduce | refine | stuff
  chunk_size: 2000
  chunk_overlap: 200
  max_summary_length: 300

extraction:
  model: "gpt-4o"
  confidence_threshold: 0.8   # Flag low-confidence extractions
  validate_types: true         # Enforce field type constraints
  schema_dir: "schemas/"

batch:
  max_workers: 4
  max_retries: 2
  output_format: "json"        # json | csv | parquet
  save_intermediate: false     # Save per-stage outputs for debugging

Best Practices

Pre-process scanned documents — Deskew, denoise, and binarize before OCR to dramatically improve text quality.
Use the cheapest model for summarization chunks — Only use GPT-4 for the final merge step; gpt-4o-mini handles individual chunks well.
Define schemas per document type — Generic extraction is weak. Dedicated schemas for invoices, contracts, and reports yield much higher accuracy.
Validate extraction results — Always check confidence_threshold on extracted fields and route low-confidence items for human review.
Process in batches, not one-by-one — The BatchProcessor handles parallelism, retries, and partial failures automatically.
Keep OCR language packs minimal — Only install language packs you actually need to keep the deployment lightweight.

Troubleshooting

Problem	Cause	Fix
OCR produces garbled text	Poor scan quality or wrong language	Enable preprocessing (deskew, denoise) and verify `language` setting
Table extraction misses tables	Tables use borderless/minimal styling	Switch `detection_method` to `stream` or `hybrid`
Summarization loses critical details	Chunk size too small, important info split	Increase `chunk_overlap` to 300+ and use `refine` strategy
Structured extraction returns null fields	Schema field descriptions too vague	Add specific descriptions and example values to each `Field`

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [Document AI Toolkit] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

DEV Community

Document AI Toolkit

Document AI Toolkit

Key Features

Quick Start

Architecture

Usage Examples

Define Custom Extraction Schemas

Batch Processing with Progress Tracking

Multi-Stage Summarization for Long Documents

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)